Delete content with Regular Expressions - Regex

0

I wanted to ask you a question about how you can delete a content with Regular Expressions. I am adding this regular expression, but the problem I have is that it does not know in which DIV it must close. And therefore you eat almost all my content. It seems that Regular Expressions are not able to know where they should cut the DIV.
This would be the HTML content:

<div class="aawp">
  <div id="aawp-tb-445">
     <div class="aawp-tb aawp-tb--desktop aawp-tb--cols-5 aawp-tb--hide-labels">
         ...
      </div>
   </div>
</div>

And I try to extract it with this Regular Expression , but as I mentioned above. You do not know where you should close the corresponding DIV, so you eat all the HTML content.

<div class="aawp">(.*?)<\/div>
    
asked by Fumatamax 18.08.2018 в 12:11
source

2 answers

0

Regular expressions are not the right tool for this type of task. It is not wrong to apply them to HTML pages when it comes to simple and consistent situations, but it is very difficult (and in some implementations impossible) to make a regular expression deal with the complexity of HTML. If you need to do this type of task on a recurring basis and it is not sure that the style of marking is uniform, it is better that you use a true parser. Search for one online, either in the form of a library or complete tool, and save yourself headaches.

That said, I came to this regular expression that seems to do the desired work with that specific example (using PCRE in GNU grep ):

(?<=\n)?(\s*?)<div class="aawp">\n(?:.|\n)*?\n<\/div>(?=\n)?

Example:

$ cat file
<div class="foo">
  <div id="aawp-tb-445">
    <div class="aawp-tb aawp-tb--desktop aawp-tb--cols-5 aawp-tb--hide-labels">
      ...
    </div>
  </div>
</div>

<div class="bar">
  <div class="aawp">
    <div id="aawp-tb-445">
      <div class="aawp-tb aawp-tb--desktop aawp-tb--cols-5 aawp-tb--hide-labels">
        ...
      </div>
    </div>
  </div>
</div>

<div class="baz">
  <div id="aawp-tb-445">
    <div class="aawp-tb aawp-tb--desktop aawp-tb--cols-5 aawp-tb--hide-labels">
      ...
    </div>
  </div>
</div>
$ grep -Poz '(?<=\n)?(\s*?)<div class="aawp">\n(?:.|\n)*?\n<\/div>(?=\n)?' file
  <div class="aawp">
    <div id="aawp-tb-445">
      <div class="aawp-tb aawp-tb--desktop aawp-tb--cols-5 aawp-tb--hide-labels">
        ...
      </div>
    </div>
  </div>

Explanation:

  • (?<=\n)? : Equivalent to ^ . Look for the preceding line break, if it exists, but do not include it in the result.
  • (\s*?) : Find and capture the indentation for later reference.
  • <div class="aawp"> : The corresponding opening tag.
  • \n : Equivalent to $ .
  • (?:.|\n)*? : Similar to .*? but accepting line breaks.
  • \n : Equivalent to ^ .
  • : Search for the same level of indentation previously captured.
  • <\/div> : The corresponding closing tag.
  • (?=\n)? : Equivalent to $ . Look for the next line break, if it exists, but do not include it in the result.

Take into account that, given the explanation in the first paragraph, this pattern would fail in cases like these:

  • Indent level does not match.
  • The class of the tag changes.
  • The HTML file is minimized to save space.
answered by 18.08.2018 в 15:22
0

The answer from @nxnev is adequate, you should use an HTML parser. You will avoid problems and make the code more maintainable.

If for whatever reason you want to go ahead with a regular expression, keep in mind that you will probably have limitations. In any case the one that I present to you should avoid many of these problems:

( note : will only work for languages compatible with PCRE regular expressions (php, perl, python, ... javascript for example is not compatible with PCRE))

Extended version (need flag x ):

(?=<div[ ]class="aawp">)  # El primer div debe ser con clase aawp
(                 # primer grupo (será la base de la recursividad)

  #--- Opciones ---#
  # Cualquier cosa salvo <> una o más veces
    [^<>]+
  # Cualquier etiqueta vacía (void element)
  | <(?=area|base|br|col|embed|hr
       |img|input|link|meta|param|source
       |track|wbr)\w+[^>]*>
  # Comentarios html
  | <!-- .*? -->
  # Cualquier otra etiqueta (puede tener anidación)
  # Recursividad con grupo 1 (?1). El grupo 2 se usa para 
  # cerrar la misma etiqueta original
  | <(\w+)[^>]*>(?1)*</>
)

Demo (with flag 'x')

The compact version (without flag x ):

(?=<div[ ]class="aawp">)([^<>]+|<(?=area|base|br|col|embed|hr|img|input|link|meta|param|source|track|wbr)\w+[^>]*>|<!--.*?-->|<(\w+)[^>]*>(?1)*</>)

Demo

Even so, they are not exempt from limitations. For example, you would not find something like this: <div>hola</di> or <div style="no deberia tener un mayor >"></div> or <div>></div>

The first is easily solvable by changing </> by </[^>]+> but I preferred to leave it as it was because I did not see correct to consider labels with different names as equals even if they correspond to the same level.

    
answered by 21.08.2018 в 18:40