Detect duplicate words with Regex

2

I'm trying to find all the duplicate elements in a sentence.

To do this, I am testing with the following code, but I only detect the first duplicate word and I would like to replace all of them, regardless of whether they are uppercase or lowercase.

This is my code:

public static void main(String[] args) {

String regex = "\b(\w+)\s+\1\b+";
Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE /* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
String input = in.nextLine();
Matcher m = p.matcher(input);
while (m.find()) {
        input = input.replaceAll(m.group(), m.group(1));
    }

    // Prints the modified sentence.
    System.out.println(input);
}

For example, in the sentence:

Hola hola hOla

You should print only:

Hola

Currently I print:

Hola hOla
    
asked by pitiklan 04.01.2017 в 11:38
source

2 answers

3

Try this:

String regex = "\b(\w+)\b(\s+\1)+\b";

Explanation:

  

The \b indicate the limit of the word, without them for example a    "Hola OLA" would match Hola since removing the H the pattern   would be correct "ola OLA" , but we want you to look at the words    complete .

     

The \w tells us character ( [a-zA-Z0-9_] ) and the + that can have several characters of   length. Therefore, \w+ indicates that it is a word.

     

The \s+ indicates that it can contain one or more spaces.

     

The saves our first block of expression (whatever it contains)    \w+ ), therefore it must be the same.

     

The% co_of final% means that the expression can be repeated several times    + .

     

So in the end we get an expression formed by   Word1 + (Space + Word), being able to repeat itself (Space + Word1) n times.

    
answered by 04.01.2017 / 13:02
source
2

Assuming you are looking for consecutive duplicates, separated by spaces, there are 2 points to correct about the code you are trying:

  • In the pattern used, you are only looking for 1 consecutive word with \s+\b+ , since the last + only repeats \b (and is unnecessary). Instead, you should repeat all this subpattern, grouping with (?:\s+)+\b .

  • A loop is being used first to find the matches, and then replace the found string, something that can lead to replacement in the wrong places. Instead, you should use Matcher .replaceAll () to perform all replacements with a single call to the method. As a replacement, we will use $1 to refer to the captured text of the first word.

  • Regex:

    \b(\w+)(?: +)+\b
    

    Replacement:

    $1
    

    Text:

    Hola hola hOla
    

    Result:

    Hola
    

    Code:

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    final String regex = "\b(\w+)(?: +\1)+\b";
    final Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    
    final String reemplazo = "$1";
    
    Scanner in = new Scanner(System.in);
    String input = in.nextLine();
    Matcher m = p.matcher(input);
    
    //Se reemplazan todas las ocurrencias
    input = m.replaceAll(reemplazo);
    
    System.out.println(input);
    

    Demo:

    link

    However, the previous solution does not consider accents in the words. To consider them, you can use the UNICODE_CHARACTER_CLASS switch.

    final Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);
    
        
    answered by 04.01.2017 в 14:15