Get text within annotated tags using BeautifulSoup

2

I'm scraping an HTML document using BeautifulSoup4. But I was obliged to obtain Text commented. If I want to get commented text like this:

<!-- este es el texto -->

That can be obtained (bypassing the other statements) like this:

texto = soup.find_all(string = lambda text:isinstance(text,Comment))

But I want to get the text inside a tag commented like this:

<!-- <span>texto que quiero</span> -->

Is there any way that can be done? The python code that I put returns an object from the bs4 library, not a string, so I do not know how to convert it to string and do a replace, this as a last resort, since I would like to do almost everything using the functionalities of bs4

    
asked by Cesar Augusto 10.09.2018 в 11:21
source

1 answer

5

Once you have found the comment, you can convert it to a string with str() .

In your case, the resulting string in the background is another html document, so you can perfectly use BeautifulSoup on it again, to analyze it and look inside the <span> tag or whatever.

Demonstration of the idea:

>>> from bs4 import BeautifulSoup, element

>>> doc = """
<html><head><title>Ejemplo</title></head>
<body>
<p class="title"><b>Ejemplo</b></p>

<!-- <span>texto que quiero</span> -->

<p>Texto adicional de relleno que no viene al caso.</p>

<p class="story">...</p>
</body>
</html>
"""

>>> soup = BeautifulSoup(doc)
>>> comentarios = soup.find_all(string = lambda text: isinstance(text, element.Comment))
>>> primer_comentario = str(comentarios[0])
>>> texto = BeautifulSoup(primer_comentario).span.contents[0]
>>> texto
'texto que quiero'
    
answered by 10.09.2018 / 12:30
source