Is there a reference to know all the characters that should be escaped in regular expressions?

4

In a code (Java) I must extract a series of characters from a string.

This is the code I'm using:

html=html.replaceAll("[«»\"\'“”¿?]","");

If I try to escape the question marks I get an error. Should not the question marks escape?

My questions are two:

  • Is there a reference (list) that you can consult to know which expressions are mandatory to escape in REGEX?
  • In REGEX, the escape character is always this \ or are there other escape characters?
asked by A. Cedano 13.12.2018 в 13:53
source

2 answers

2

Any character that has a special meaning within a regular expression (such as the . that represents any sign, the * to indicate zero or more repetitions, the + to indicate one or more, etc. ..) must be "escaped" (with a \ in front) if you want to lose the special meaning and be the literal character. For example, the regular expression \*\+ represents an asterisk followed by a plus sign.

In particular, the special characters are:

[\^$.|?*+()

In some contexts it is not necessary to "escape" them, as for example inside the brackets, but in general it does not hurt to escape them there too and so you do not have to remember exceptions.

Another important thing , with which you might be making a mess, is that in most languages (Java among them), the \ character is also the one used to enter special characters within a string, such as \n , etc ... And among other things to escape the character " that would otherwise be taken as the end of chain.

Therefore, if a \ appears within a string, Java (or C, or Python) wait for a character such as n (and others) and give it a special meaning. If you want to avoid that, you have to escape \ by putting \ .

If you are using \ to escape a special sign within a regular expression, such as the one where we saw before \*\+ , and moreover that regular expression goes inside a string in quotes, you should escape the \ it contains and write "\*\+" it will not be that the character that goes after the \ has a special meaning for Java (or C) as it happens with \n .

That brings us to the almost aberrant situation that if you want to make a regular expression that detects the \ , you should escape it inside the regular expression and write \ , and also when you go in quotes in Java, you should escape both and write "\\" .

In your case

According to what was said in the regular expression that you put in your question, the only character you need to escape is ? , and therefore the regular expression you need is:

[«»"'“”¿\?]

Although the question mark was inside brackets you would not need to escape, as I said before, nothing happens if it escapes.

But then, when you put the above in a Java chain you find two problems:

  • A " appears in the expression, which for Java would be the end of the string. You have to escape it (but for Java, not for the regular expression)
  • The \ also appears in front of the question mark, you have to escape it (again for java).

So you should put:

html=html.replaceAll("[«»\"'“”¿\?]","");

Notice that the chain that will actually receive the function is [«»"'“”¿\?] and not [«»\"'“”¿\?] (in the same way that if you put "\n" a character will be saved, the carriage return, and not two \ and n ).

    
answered by 13.12.2018 в 16:24
0

There are no obligatory expressions to escape, the special characters that you want to take literally escape. in your case the ? is a special character. and if the \ is to the escape character

    
answered by 13.12.2018 в 15:28