Regular Expressions in Python

0

I was trying to extract the information from an html code, in particular what interests me is to focus on the content of this html tag:

<h2> <a href="https://www.xataka.com/robotica-e-ia/deepfakes-tendremos-problema-verdad-videos-serviran-como-pruebas" class="l:3035280" > Con los deepfakes tendremos un problema con la verdad: ni los vídeos servirán como pruebas </a>  </h2>

What I want is to keep the information that is between the h2 tag, since later I thought to filter what really interests me that is the phrase "With the deepfakes we will have a problem with the truth: neither the videos will serve as tests "I thought to do with a re.search.

I tried to catch [\s\w]+<\/h2> thinking about staying with all the letters and spaces that precede the </h2> tag but I can not get the result I want.

If someone could help me, I'd appreciate it.

Thank you very much

    
asked by Andy06 01.11.2018 в 19:44
source

3 answers

2

Although the question had the tag regexp , it also had the tag python , so I'm going to give another solution that does not use regular expressions (Well, I think Julio's answer already covers that case perfectly).

In general, parsing HTML using regular expressions is somewhat complex, and results in a fragile code, difficult to read and maintain.

It is preferable to use tools specifically designed to parse HTML and access the resulting "DOM" (document model), such as lxml and xpath . Unfortunately this utility is not pre-installed with Python, so you'll have to install it with pip install (preferably within a virtual environment). Once installed you can use it for example in the following way:

import lxml.html

html="""
<h2> <a href="https://www.xataka.com/robotica-e-ia/deepfakes-tendremos-problema-verdad-videos-serviran-como-pruebas" class="l:3035280" > Con los deepfakes tendremos un problema con la verdad: ni los vídeos servirán como pruebas </a>  </h2>
<p> Este texto está fuera del h2 </p>
<h2> <b><u> Con los Deepfakes... </u></b> </h2>
"""

dom =lxml.html.fromstring(html)
for h2 in dom.xpath("//h2"):
  print("Contenido:", h2.text_content().strip())

And you would get:

Contenido: Con los deepfakes tendremos un problema con la verdad: ni los vídeos servirán como pruebas
Contenido: Con los Deepfakes...

The expression "//h2" that I have passed to Xpath means "any element of type h2 that appears in the document, regardless of the level of nesting in which it appears". That expression selects all the elements h2 of the document, with everything they contain. The text_content() method extracts what is pure unmarked text within those elements.

Another library that is often used to parsear html is Beautifulsoup that has a perhaps more friendly syntax (but unlike XPath is not a standard). With this library it would be like this:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
for h2 in soup.find_all("h2"):
  print("Contenido:", h2.get_text().strip())
    
answered by 05.11.2018 в 09:30
1

Try the following: <h2>(?:\s*<[^>]+>)*\s*(.*?)\s*(?:<\/[^>]+>\s*)*<\/h2> Should work with any label or tags within <h2>

As:

  • <h2><a href="..."> Foo Bar </a></h2>
  • <h2><u><b><i> Foo Bar </i></b></u></h2>
  • Etcetera

In capture group 1 you will have the text without labels.

See example / demo: link

    
answered by 04.11.2018 в 01:44
0

If you are interested in the text "With the deepfakes we will have a problem with the truth: neither the videos will serve as proofs", what you really need is the information of the tag <a> ( not <h2> ).

And you can do it in the following way:

texto = "<h2> <a href=\"https://www.xataka.com/robotica-e-ia/deepfakes-tendremos-problema-verdad-videos-serviran-como-pruebas\" class=\"l:3035280\" > Con los deepfakes tendremos un problema con la verdad: ni los vídeos servirán como pruebas </a>  </h2>"

import re
reg = '\<a.*\>(.*)\<\/a\>'
m = re.search(reg, texto)

print(m.group(1))

Here is the link if you want to do more tests with that Regex: link

    
answered by 01.11.2018 в 20:00