Although the question had the tag regexp
, it also had the tag python
, so I'm going to give another solution that does not use regular expressions (Well, I think Julio's answer already covers that case perfectly).
In general, parsing HTML using regular expressions is somewhat complex, and results in a fragile code, difficult to read and maintain.
It is preferable to use tools specifically designed to parse HTML and access the resulting "DOM" (document model), such as lxml
and xpath
. Unfortunately this utility is not pre-installed with Python, so you'll have to install it with pip install
(preferably within a virtual environment). Once installed you can use it for example in the following way:
import lxml.html
html="""
<h2> <a href="https://www.xataka.com/robotica-e-ia/deepfakes-tendremos-problema-verdad-videos-serviran-como-pruebas" class="l:3035280" > Con los deepfakes tendremos un problema con la verdad: ni los vídeos servirán como pruebas </a> </h2>
<p> Este texto está fuera del h2 </p>
<h2> <b><u> Con los Deepfakes... </u></b> </h2>
"""
dom =lxml.html.fromstring(html)
for h2 in dom.xpath("//h2"):
print("Contenido:", h2.text_content().strip())
And you would get:
Contenido: Con los deepfakes tendremos un problema con la verdad: ni los vídeos servirán como pruebas
Contenido: Con los Deepfakes...
The expression "//h2"
that I have passed to Xpath means "any element of type h2
that appears in the document, regardless of the level of nesting in which it appears". That expression selects all the elements h2
of the document, with everything they contain. The text_content()
method extracts what is pure unmarked text within those elements.
Another library that is often used to parsear html is Beautifulsoup that has a perhaps more friendly syntax (but unlike XPath is not a standard). With this library it would be like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for h2 in soup.find_all("h2"):
print("Contenido:", h2.get_text().strip())