The answer from @nxnev is adequate, you should use an HTML parser. You will avoid problems and make the code more maintainable.
If for whatever reason you want to go ahead with a regular expression, keep in mind that you will probably have limitations. In any case the one that I present to you should avoid many of these problems:
( note : will only work for languages compatible with PCRE regular expressions (php, perl, python, ... javascript for example is not compatible with PCRE))
Extended version (need flag x
):
(?=<div[ ]class="aawp">) # El primer div debe ser con clase aawp
( # primer grupo (será la base de la recursividad)
#--- Opciones ---#
# Cualquier cosa salvo <> una o más veces
[^<>]+
# Cualquier etiqueta vacía (void element)
| <(?=area|base|br|col|embed|hr
|img|input|link|meta|param|source
|track|wbr)\w+[^>]*>
# Comentarios html
| <!-- .*? -->
# Cualquier otra etiqueta (puede tener anidación)
# Recursividad con grupo 1 (?1). El grupo 2 se usa para
# cerrar la misma etiqueta original
| <(\w+)[^>]*>(?1)*</>
)
Demo (with flag 'x')
The compact version (without flag x
):
(?=<div[ ]class="aawp">)([^<>]+|<(?=area|base|br|col|embed|hr|img|input|link|meta|param|source|track|wbr)\w+[^>]*>|<!--.*?-->|<(\w+)[^>]*>(?1)*</>)
Demo
Even so, they are not exempt from limitations. For example, you would not find something like this: <div>hola</di>
or <div style="no deberia tener un mayor >"></div>
or <div>></div>
The first is easily solvable by changing </>
by </[^>]+>
but I preferred to leave it as it was because I did not see correct to consider labels with different names as equals even if they correspond to the same level.