That's the problem of trying to process HTML with regex. There is always an exception that will break your pattern, since there is no chance that you put all the HTML syntax in a single regex (theoretically yes, but at least not within the logical).
The error happens because the .+?
, while trying to consume as few characters as possible, tries to match at least 1 character.
<option\s?(value="(.+?)?")?\s?(selected="selected")?>(.+?)<\/option>
^^^
acá consume el texto: " selected="selected
Then, as it meets the text
<option value="" selected="selected"
|
|
El cursor del intento de coincidencia empieza acá
.+? Primero intenta coincidir con 1 caracter (las comillas de cierre)
pero no coincide el resto
Entonces luego con 2 caracteres, y con 3, y con 4, etc.
Until finally find a match consuming:
<option value="" selected="selected">--Ninguno--</option>
\___________________/
.+?
coincide con este texto
You might think that the ?
exterior makes it optional, but even so, that quantifier is greedy ( greedy ), and try to match a repeat first before trying zero.
<option\s?(value="(.+?)?")?\s?(selected="selected")?>(.+?)<\/option>
^^^^^^
este cuantificador intenta coincidir primero con 1 repetición
The simple solution would be to tell you to change the quantifier +
by *
, leaving:
<option\s?(value="(.*?)")?\s?(selected="selected")?>(.*?)<\/option>
And even simplify the expression a bit, without creating groups that you probably do not use:
<option\s*(?:value\s*=\s*"(.*?)"\s*)?(selected\s*=\s*"selected"\s*)?>(.*?)<\/option>
Or without allowing quotation marks inside the quotes:
<option\s*(?:value\s*=\s*"([^"]*)"\s*)?(selected\s*=\s*"selected"\s*)?>(.*?)<\/option>
But that is still going to fail in many cases, like any attempt at this style of processing HTML with regex.
These are some examples of cases that would make this pattern fail, or similar to this:
-
You are using .*?
for the content of <option/>
. What happens if the content occupies more than one line?
-
What if this was part of the JavaScript code on your page?
<script>
var rompeRegex = "<option>que no es parte de la página</option>";
</script>
-
What if there was a comment of the style:
<option> Texto <!-- y este es el último </option> --> de la opción </option>
-
And the list goes on, and goes on, and follows
The effective solution . You are programming in JavaScript, a language that is specially designed to deal with HTML. Use the tools of the language, and process the HTML with DOM, as it should be.
There are 2 main options within JavaScript:
Simple but insecure.
You can load the html inside the current page (in a hidden div for example), and process it as if you were processing your own page.
Use DOMImplementation .
In this way, you would create a document model that is not associated with your page.
As you commented after the objective is to use it with Apex, my answer remains the same, with a Document Object Model (DOM).
On Apex: see Dom Namespace .