Can HTML parsing be done with regular expressions?

20

Yesterday I translated the response from RegEx match open tags except XHTML self-contained tags with its famous snippet of code:

  

You can not parse [X] HTML with regular expressions because HTML can not be parsed with regex. Regex is not a tool that can be used to correctly parse HTML. As I have already answered many HTML and regex questions, the use of regex will not allow you to process HTML. Regular expressions are a tool that is not sophisticated enough to understand the constructions used by HTML. HTML is not a regular language and, therefore, can not be parsed through regular expressions. Regular expressions are not equipped to dissect the HTML in their representative parts.

that ends with a final demonstration of broken HTML:

  

appears , the pestilent inf ection of regex dev will pray your parser of HT ML, your application and your existence forever as a mere Visual Basic or worse < i> he comes not luc hes he comes v̡im̡ie̶ne, ̕h̵u radiance destr going҉ all luminescence, the HTML tags filtering from ̡tu ̸S eyes̸ ̛like lí era I can see it can ͚̖͔̙see e s beautiful or the end extinguishing the lies of men TO DO EŚ͖̩͇̗̪̏̈TÁ ALL LOST EST Á PE RDI DO e l pon̷ and he vie ne he v he has vien e el íco r pe rme a to do M I FACE M I FACE ᵒh two n or o NO NOO̼ O ON Θ for l os án * ̶͑̾̾ ̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆ul͖͉̗̩̳̟̍ͫͥͨ or s ͎ n or are rè̑ͧ̌aͨl̘̝̙ͤ̾̆es ZA̡͊͠͝LGΌ ESͮ҉̯͈͕̹̘ ALȳ̳ Ë͖̉L ͠P̯͍̭O̚ N̐Y̡ Ȩ̬̩̾͛ͪ̈͘L ̶̧̨̹̭̯ͧ̾ͬViENȆ̴̟̟͙̞ͩ͌͝ "

I took your statements for granted:

  • HTML can not be parsed with regex
  • the regex are not sophisticated enough for this purpose
  • HTML is not a regular language and, therefore, can not be parsed with regular expressions.

But then I received a Mariano's comment :

  

I know this is a joke that became famous. However, "HTML can not be parsed with regex" is false. "It's not sophisticated enough" is false. "They are not equipped to dissect HTML" is false. "It is not a regular language and, therefore, can not be analyzed syntactically through regular expressions" is flatly false. What is true is that it will bring you headaches, because it is not a tool that fits that job ... I hate this publication.

And I was left doubting. Later searches brought me a Jeff Atwood blog post Parsing Html The Cthulhu Way , from 2009, where he starts talking about the answer I just quoted, showing the feeling that generated it. However, parsea the state of the question and shows that it is not so clear that it can not be. Mention a discussion in which Experienced programmers defend its use in certain cases.

Therefore, the doubt is:

  • Can you parse an HTML with regular expressions?
  • In what cases is it advisable to do so?
  • In what cases is it inadvisable?

It will have been noted that I use parsing and parsing interchangeably. I do it because one seems the translation of the other, but it is no less true that in Spanish-speaking environments the use of parsear is very widespread.

    
asked by fedorqui 25.07.2017 в 11:11
source

2 answers

20

The first question is to know what we mean by "parsing HTML".

The strict interpretation is to process the document, verify that it is a correct HTML, work with the entire document, etc. In that sense, regular expressions are completely insufficient .

The classic example is that of elements that can be nested indefinitely. If I start doing <div><div><div>....<div>Hola mundo</div>....</div></div></div> , there is no regular expression that can verify that I have opened the same number of div that I closed (source: theory of finite automata).

Now, this is when someone comes in and says: "But I'm not building a web browser / grammar analyzer, what I want to know is what it says within the div I do not care if all the tags are closed or not, that's the problem of who generates the HTML, for me, regular expressions are completely sufficient . "

Naturally, if there are changes in the HTML, regular expressions are much more fragile. The problem is not so much that they fail 1 as they give false positives.

For example, we have our expression to find the content of <div> ( <div>(.*)<\/div> ), and suddenly the page changes to:

 <div>Hola mundo<!-- Tonto el que lo lea!!--></div>

Wow ... we better change it to ( <div>(.*)< ), right? Well, until it reaches us:

 <div>Hola <a href="http://micasa.example">mundo</a></div>

Well, we solve it (I no longer put the regular expression), and the following week we have

 <!-- <div>Hola mundo</div> No lo borro, solo lo comento porque no me fío del SVN. Firmado: el novato -->
 <div>Adios mundo</div>

In all the cases above, the regular expression eats the error as if nothing and the process continues until someone (possibly a human) realizes that the values do not marry, perhaps weeks or months later 2 .

So:

  

Can you parse an HTML with regular expressions?

In general, NO .

  

In what cases is it advisable to do it?

More than "advisable", it's not too much trouble when:

  • The origin of the HTML is controlled. It's a program of mine, or it's someone from my organization who will let me know when there's going to be a change.

  • Also related to the above, we know what structure it will have. If we know that it will be a document such that:

    <html><body>
    <ul>
    <li>Punto 1.</li>
    <li>Punto 2.</li>
    ...
    </ul>
    

    and that you will not get tags or comments or JavaScript in between, no problem 3

  

In what cases is it inadvisable?

All others.

1 If it fails, the error is processed and the regular expression is adapted appropriately. After all, if the format of the page that is parsed changes, also the programs that use grammatical analyzers can have problems (although they will always be more flexible).

2 A different kind of problem would be if I want to get the content of the first div and move the content to the third one. But that is unsolvable for both regexp and parsers unless id is used in the elements; and if id is used, what is sought is not the nth div but the element with the corresponding id .

3 In fact, the subset of HTML thus defined is actually a regular language, so regular expressions are completely sufficient to analyze it completely.

    
answered by 25.07.2017 / 13:21
source
-2
  

Can you parse an HTML with regular expressions?

Yes, of course (not tested):

if(!preg_match('#(?<=<)\w+(?=[^<]*?>)#', $string)){ 
    return $string;
}

$patterns = array('<b>','<p>','<br>'); //etc array de etiquetas

// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);

if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}
  

In what cases is it advisable to do it?

I would recommend it, only under the check of open and closed labels are the same number even if they syntactically differ (this case needs a succession check), and in the obtaining of the content (requires the cleaning of html Tags);

  

In what cases is it inadvisable?

When you need to use data, of the content.

    
answered by 18.08.2017 в 17:19