optional groups in regular expressions

Question

optional groups in regular expressions

Navigation

#1 by (2 votes)

3

Answering another question in SO, I was asked the question of how to deal with regular expressions when they contain optional groups.

For example, if you would like to capture the phone number and favorite number in the following text:

Hello my phone is 12345678 and my favorite number is 13

I would use an expression like:

telefono[^\d]*(\d+).*numero favorito[^\d]*(\d+)

If both data were optional, I would do something like:

(?:telefono[^\d]*(\d+))?.*(?:numero favorito[^\d]*(\d+))?

But that expression does not work, since .* makes match everything and the optional groups are empty.

The only way I have found to specify the characters between both groups so that they continue to work is with a negative lookahead of all the chains that I occupy in the groups:

(?:telefono[^\d]*(\d+))?(?:(?!telefono|numero favorito).)*(?:numero favorito[^\d]*(\d+))?

Although with this we can already obtain the optional groups, there are possible matchs in which no group is occupied. Besides, something like that would not scale very well for many groups.

Is there an alternative?

java regex

asked by Klaimmore 17.03.2018 в 02:14

source

1 answer

JPA problems persisting cascading a subclass Header: location - Does not work

score 2 · Accepted Answer

Multiple groups, all optional, differentiating each one in the result

Everything you mentioned in the question makes sense, and it is a good analysis of the problem. But it can be dealt with in an easier way. Instead of looking to coincide with the two numbers in a single match, it is convenient to think of independent coincidences:

telefono\D*(\d+)|numero favorito\D*(\d+)

In this way, in each match, look for one or the other, and return a match for group 1 or group 2 according to which corresponds. We call Matcher # find () while keep matching:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "telefono\D*(\d+)|numero favorito\D*(\d+)";
final String texto = "hola mi telefono es 12345678 y mi numero favorito es el 13";

final Matcher matcher = Pattern.compile(regex).matcher(texto);

while (matcher.find()) {
    if (matcher.group(1) != null) {
        System.out.println("Tel: " + matcher.group(1));
    } else {
        System.out.println("Num: " + matcher.group(2));
    }
}

Anyway, I know that your question points more to theory than practice. If the groups have to appear in that order in the text, being equally optional, then the way to capture them would be by adding the intermediate text ( .* ) within the optional part. That is:

^(?:.*telefono\D*(\d+))?(?:.*numero favorito\D*(\d+))?

Recall that the engine of regex is goloso ( greedy ), so for each quantifier, always try to match as much as possible. In this case it means that the (?: ... )? tries with 1 before 0 ... With that we guarantee to go through the whole string until we find a match (for example in .*telefono\D*(\d+) ), and just take it as optional if that part does not match.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "^(?:.*telefono\D*(\d+))?(?:.*numero favorito\D*(\d+))?";
final String texto = "hola mi telefono es 12345678 y mi numero favorito es el 13";

final Matcher matcher = Pattern.compile(regex).matcher(texto);

if (matcher.find()) { // ← if redundante (siempre coincide)
    if (matcher.group(1) != null) {
        System.out.println("Tel: " + matcher.group(1));
    }
    if (matcher.group(2) != null) {
        System.out.println("Num: " + matcher.group(2));
    }
}

You must use Pattern.DOTALL if it can extend beyond a line break.

If they can be presented in any order, it is only necessary to replace the groups without capture (?: ... )? by positive surveys ( lookaheads ) (?= ... )? .

Another way to have multiple optional groups in order is:

(?:telefono\D*(\d+).*?)?(?:numero favorito\D*(\d+).*?)?$

What it does is that if it matches one of the groups, it continues to consume as little as possible ( not greedy , lazy ) with .*? until the next group, but at the same time I am forcing it to go through the whole string until it coincides with the end $ .