Problem with accents in Regex

Question

Problem with accents in Regex

Navigation

#1 by (5 votes)
#2 by (0 votes)

2

I am using the website link . Using the phrase as a regular expression:

(?i)purificación

And as a text

PURIFICACIÓN

To my surprise they do not machean, I thought that the case insensitive (i) modifier would solve this, but it does not. Any ideas?

PS: Making a replaceAll to the text and changing OR to O is not a valid option (for me) to my regret.

java regex

asked by Daniel Faro 10.08.2016 в 09:35

source

2 answers

0

The solution is to add the modify u to the regular expression. With this modifier you indicate that the pattern will be treated as a string UTF-16 and not ASCII.

I attached a screenshot of the change in regex101: each modifier means:

i: case insensitive
u: treat the pattern as UTF-16
m: multi-line matches
g: makes a match in a global way, returning all the matches

Finally, I recommend using the i modifier as the pattern modifier and not as you added it:

/purificación/imug <- trata de forma insensitive TODO el patrón
/(?i)purificación/imug <- trata de forma insensitive **a partir** del donde lo pongas

for example: /a(?i)purificación/mug fits with 'apurificación', 'aPURIFICACIÓN', 'aPUrifICAciÓn', etc. Always with the first 'a' lowercase

instead: /apurificación/imug fits with 'apurificación', 'Apurificación', 'apuRriFICación', etc. It does not matter if the 'a' is uppercase or lowercase.

answered by 10.08.2016 в 10:21

Start XAMPP server from CMD Problem with while loop: does not work or

score 5 · Accepted Answer

By default, (?i) assumes only characters in the US-ASCII game. However, you can enable case-insensitive matching in Unicode using one of the following two ways:

Embedded flags

Add the u flag to enable case-insensitive matching in Unicode. That is:

public static void main (String[] args) {
    String str = "PURIFICACIÓN";
    System.out.println(
        str.matches("(?iu)purificación") 
    ); // imprime "true"
}

Bit Mask

Another way is by specifying the corresponding flags in a bit mask in the method compile of the class < a href="http://bit.ly/2biRBTL" title="java.util.regex.Pattern"> Pattern . This is:
```
public static void main(String[] args) {
    String input = "PURIFICACIÓN";
    Pattern regexPattern = Pattern.compile("purificación", 
            Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    System.out.println(
        regexPattern.matcher(input).matches()
    );  // imprime "true"
}
```
The bitmask can include CASE_INSENSITIVE , MULTILINE , DOTALL , UNICODE_CASE , CANON_EQ , UNIX_LINES , LITERAL , UNICODE_CHARACTER_CLASS and COMMENTS .

Summary

Next, the list of flags that you can use with its corresponding embedded flag (if it exists):

+-------------------------+------+
| UNIX_LINES              | (?d) |
| CASE_INSENSITIVE        | (?i) |
| COMMENTS                | (?x) |
| MULTILINE               | (?m) |
| LITERAL                 |      |
| DOTALL                  | (?s) |
| UNICODE_CASE            | (?u) |
| CANON_EQ                |      |
| UNICODE_CHARACTER_CLASS | (?U) |
+-------------------------+------+

Problem with accents in Regex

2 answers

Embedded flags

Bit Mask