Problem with accents in Regex

2

I am using the website link . Using the phrase as a regular expression:

(?i)purificación

And as a text

PURIFICACIÓN

To my surprise they do not machean, I thought that the case insensitive (i) modifier would solve this, but it does not. Any ideas?

PS: Making a replaceAll to the text and changing OR to O is not a valid option (for me) to my regret.

    
asked by Daniel Faro 10.08.2016 в 11:35
source

2 answers

5

By default, (?i) assumes only characters in the US-ASCII game. However, you can enable case-insensitive matching in Unicode using one of the following two ways:

  • Embedded flags

    Add the u flag to enable case-insensitive matching in Unicode. That is:

    public static void main (String[] args) {
        String str = "PURIFICACIÓN";
        System.out.println(
            str.matches("(?iu)purificación") 
        ); // imprime "true"
    }
    
  • Bit Mask

    Another way is by specifying the corresponding flags in a bit mask in the method compile of the class < a href="http://bit.ly/2biRBTL" title="java.util.regex.Pattern"> Pattern . This is:

    public static void main(String[] args) {
        String input = "PURIFICACIÓN";
        Pattern regexPattern = Pattern.compile("purificación", 
                Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
        System.out.println(
            regexPattern.matcher(input).matches()
        );  // imprime "true"
    }
    

    The bitmask can include CASE_INSENSITIVE , MULTILINE , DOTALL , UNICODE_CASE , CANON_EQ , UNIX_LINES , LITERAL , UNICODE_CHARACTER_CLASS and COMMENTS .

Summary

Next, the list of flags that you can use with its corresponding embedded flag (if it exists):

+-------------------------+------+
| UNIX_LINES              | (?d) |
| CASE_INSENSITIVE        | (?i) |
| COMMENTS                | (?x) |
| MULTILINE               | (?m) |
| LITERAL                 |      |
| DOTALL                  | (?s) |
| UNICODE_CASE            | (?u) |
| CANON_EQ                |      |
| UNICODE_CHARACTER_CLASS | (?U) |
+-------------------------+------+
    
answered by 10.08.2016 / 15:36
source
0

The solution is to add the modify u to the regular expression. With this modifier you indicate that the pattern will be treated as a string UTF-16 and not ASCII.

I attached a screenshot of the change in regex101: each modifier means:

  • i: case insensitive
  • u: treat the pattern as UTF-16
  • m: multi-line matches
  • g: makes a match in a global way, returning all the matches

Finally, I recommend using the i modifier as the pattern modifier and not as you added it:

/purificación/imug <- trata de forma insensitive TODO el patrón
/(?i)purificación/imug <- trata de forma insensitive **a partir** del donde lo pongas

for example: /a(?i)purificación/mug fits with 'apurificación', 'aPURIFICACIÓN', 'aPUrifICAciÓn', etc. Always with the first 'a' lowercase

instead: /apurificación/imug fits with 'apurificación', 'Apurificación', 'apuRriFICación', etc. It does not matter if the 'a' is uppercase or lowercase.

    
answered by 10.08.2016 в 12:21