capture the values of the option regex attributes by groups

3

I have this regular expression <option\s?(value="(.+?)?")?\s?(selected="selected")?>(.+?)<\/option>  validating in the content of a page the option labels of a select that are created dynamically. I need to capture the value if it has it, know if it has the attribute selected .

As an example I have these options:

<option value="">opcion 1</option>
<option value="opcion2check">opcion 2</option>
<option value="" selected="selected">--Ninguno--</option>

With the first 2 options it works correctly. With the latter there is apparently a problem with a space, so it does not take the value as null and instead returns " selected="selected as value in the attribute value . Here is the link where I am testing the expression.

    
asked by XtheblackX 15.11.2017 в 17:41
source

3 answers

4

That's the problem of trying to process HTML with regex. There is always an exception that will break your pattern, since there is no chance that you put all the HTML syntax in a single regex (theoretically yes, but at least not within the logical).

The error happens because the .+? , while trying to consume as few characters as possible, tries to match at least 1 character.

<option\s?(value="(.+?)?")?\s?(selected="selected")?>(.+?)<\/option>
                   ^^^
               acá consume el texto: " selected="selected

Then, as it meets the text

<option value="" selected="selected"
               |
               |
        El cursor del intento de coincidencia empieza acá
        .+? Primero intenta coincidir con 1 caracter (las comillas de cierre)
        pero no coincide el resto
        Entonces luego con 2 caracteres, y con 3, y con 4, etc.

Until finally find a match consuming:

<option value="" selected="selected">--Ninguno--</option>
               \___________________/
                      .+?
                coincide con este texto


You might think that the ? exterior makes it optional, but even so, that quantifier is greedy ( greedy ), and try to match a repeat first before trying zero.

<option\s?(value="(.+?)?")?\s?(selected="selected")?>(.+?)<\/option>
                  ^^^^^^
               este cuantificador intenta coincidir primero con 1 repetición


The simple solution would be to tell you to change the quantifier + by * , leaving:

<option\s?(value="(.*?)")?\s?(selected="selected")?>(.*?)<\/option>

And even simplify the expression a bit, without creating groups that you probably do not use:

<option\s*(?:value\s*=\s*"(.*?)"\s*)?(selected\s*=\s*"selected"\s*)?>(.*?)<\/option>

Or without allowing quotation marks inside the quotes:

<option\s*(?:value\s*=\s*"([^"]*)"\s*)?(selected\s*=\s*"selected"\s*)?>(.*?)<\/option>

But that is still going to fail in many cases, like any attempt at this style of processing HTML with regex.


These are some examples of cases that would make this pattern fail, or similar to this:

  • You are using .*? for the content of <option/> . What happens if the content occupies more than one line?

  • What if this was part of the JavaScript code on your page?

    <script>
        var rompeRegex = "<option>que no es parte de la página</option>";
    </script>
    
  • What if there was a comment of the style:

    <option> Texto <!-- y este es el último </option> --> de la opción </option>
    
  • And the list goes on, and goes on, and follows


The effective solution . You are programming in JavaScript, a language that is specially designed to deal with HTML. Use the tools of the language, and process the HTML with DOM, as it should be.

There are 2 main options within JavaScript:

  • Simple but insecure.

    You can load the html inside the current page (in a hidden div for example), and process it as if you were processing your own page.

  • Use DOMImplementation .

    In this way, you would create a document model that is not associated with your page.

  • As you commented after the objective is to use it with Apex, my answer remains the same, with a Document Object Model (DOM).

  • On Apex: see Dom Namespace .
  • answered by 15.11.2017 / 18:14
    source
    1

    You must change your regular expression to the following:

    <option\s?(value="([^"]+?)?")?\s?(selected="selected")?>(.+?)<\/option>

    the part of the expression. [^"] captures everything except the double quote character " .

    The character that denotes the beginning of a string ^ when used in brackets [^] is a negation.

    here the link with the test:

    link

        
    answered by 15.11.2017 в 18:04
    -2

    Except for specific cases, trying to apply REGEX to parsing HTML is a contradiction.

    See the best answer about it and it's legendary in Stack Overflow.

      

    HTML is a Chomsky Type 2 grammar   (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular   grammar) Since a Type 2 grammar is fundamentally more complex than a   Type 3 grammar (see the Chomsky hierarchy), you can not possibly make   this work. But many will try, some will claim success and others will   find the fault and totally mess you up.

    There is even a meme that circulates on the internet

        
    answered by 15.11.2017 в 21:11