Yesterday I translated the response from RegEx match open tags except XHTML self-contained tags with its famous snippet of code:
You can not parse [X] HTML with regular expressions because HTML can not be parsed with regex. Regex is not a tool that can be used to correctly parse HTML. As I have already answered many HTML and regex questions, the use of regex will not allow you to process HTML. Regular expressions are a tool that is not sophisticated enough to understand the constructions used by HTML. HTML is not a regular language and, therefore, can not be parsed through regular expressions. Regular expressions are not equipped to dissect the HTML in their representative parts.
that ends with a final demonstration of broken HTML:
appears
, the pestilent infection of regex dev will pray your parser of HT ML, your application and your existence forever as a mere Visual Basic or worse < i> he comes not luc hes he comes v̡im̡ie̶ne, ̕h̵u radiance destr going҉ all luminescence, the HTML tags filtering from ̡tu ̸S eyes̸ ̛like lí era I can see it can ͚̖͔̙see e s beautiful or the end extinguishing the lies of men TO DO EŚ͖̩͇̗̪̏̈TÁ ALL LOST EST Á PE RDI DO e l pon̷ and he vie ne he vhe has viene elíco r pe rme a to do M I FACE M I FACE ᵒh two n or o NO NOO̼ O ON Θ for l os án * ̶͑̾̾ ̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆ul͖͉̗̩̳̟̍ͫͥͨ or s ͎ n or are rè̑ͧ̌aͨl̘̝̙ͤ̾̆es ZA̡͊͠͝LGΌ ESͮ҉̯͈͕̹̘ ALȳ̳ Ë͖̉L ͠P̯͍̭O̚ N̐Y̡ Ȩ̬̩̾͛ͪ̈͘L ̶̧̨̹̭̯ͧ̾ͬViENȆ̴̟̟͙̞ͩ͌͝ "
I took your statements for granted:
- HTML can not be parsed with regex
- the regex are not sophisticated enough for this purpose
- HTML is not a regular language and, therefore, can not be parsed with regular expressions.
But then I received a Mariano's comment :
I know this is a joke that became famous. However, "HTML can not be parsed with regex" is false. "It's not sophisticated enough" is false. "They are not equipped to dissect HTML" is false. "It is not a regular language and, therefore, can not be analyzed syntactically through regular expressions" is flatly false. What is true is that it will bring you headaches, because it is not a tool that fits that job ... I hate this publication.
And I was left doubting. Later searches brought me a Jeff Atwood blog post Parsing Html The Cthulhu Way , from 2009, where he starts talking about the answer I just quoted, showing the feeling that generated it. However, parsea the state of the question and shows that it is not so clear that it can not be. Mention a discussion in which Experienced programmers defend its use in certain cases.
Therefore, the doubt is:
- Can you parse an HTML with regular expressions?
- In what cases is it advisable to do so?
- In what cases is it inadvisable?
It will have been noted that I use parsing and parsing interchangeably. I do it because one seems the translation of the other, but it is no less true that in Spanish-speaking environments the use of parsear is very widespread.