Capture a group without capture or group capture with capture

3

Today answering a question on this site I found myself with a very interesting possible solution, because I accidentally deleted a part of the solution and that solution worked although it did not make sense for me.

Without further ado:

const regex = /([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+)/

const strings = [
        'AAAA_BBBB_CCCC_1_15_17'
        ,'AAAA_BBBB_1'
        ,'AAAA_BBBB_15_17'
        ,'AAAA_BBBB_CCCC_1_2'
    ]

strings.forEach(string => {
  const [fullMatch, ...groups] = string.match(regex)
  console.log(groups)
})

As you can see, I captured a group without capture using ((?:_\d+)+) , and on the site regex101 it works for all the languages, which until now are:

  • pcre (php)
  • javascript
  • python
  • golang

Note: seeing that not everyone reads all the available information, the important thing is that I am getting the behavior of

/([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+(?:_\d+)*)/

using

/([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+)/

which is strange, because if I do not capture, the trapped group is only the last part that agrees:

const regex = /(_\d+)+/g;
const str = '_1_2_3_4_5_6_7';
let m;

while ((m = regex.exec(str)) !== null) {
  // This is necessary to avoid infinite loops with zero-width matches
  if (m.index === regex.lastIndex) {
    regex.lastIndex++;
  }

  // The result can be accessed through the 'm'-variable.
  m.forEach((match, groupIndex) => {
    console.log('Found match, group ${groupIndex}: ${match}');
  });
}

I would like someone to explain to me why it worked to use a double capture and what implications (positive or negative) has to catch a group without capture as I did.

    
asked by Ruslan López 03.03.2018 в 18:32
source

2 answers

2
  

I captured a non-capture group using ((?:_\d+)+) , and in the regex101 site it works for all the languages that up to now are

And it will work for you in any dialect of regex.

  • All except BREs, POSIX EREs or Oracle to be exact, since they do not support groups without capture: (?: ... ) .


  

I'm getting the behavior of

/([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+(?:_\d+)*)/
     

using

/([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+)/

In fact, using the first form would be an error, since you are unnecessarily repeating the (?:_\d+)* of the end, which will never coincide with anything, because the previous construction ( (?:_\d+)+ ), already consumed all that had , leaving nothing for the last one.

It can be corroborated with an example, adding one more group around the last (?:_\d+)* .

const texto = '_123_456_789_0',
      regex = /((?:_\d+)+((?:_\d+)*))/;
      [match, grupo1, grupo2] = regex.exec(texto);

console.log('Grupo 1: "${grupo1}"');
console.log('El último '(?:_\d+)*' coincidió con: "${grupo2}"');


  

I would like someone to explain to me why it worked to use a double capture

You are not using a double capture. In ((?:_\d+)+) , only the outer group is the one that captures. And just (?: ... ) is a group without capture .

A structure such as ((?:_\d+)+) is perfectly normal and is used frequently. Think of it this way: it is the same as (\d+) , only that what is repeated in ((?:_\d+)+) are not only digits but underscores followed by digits.

Nesting groups (with or without capture) is as valid as, and practically the same as, using nested loops in your code ... Simple as that.


  

What implications (positive or negative) does a catch have of catching a group like I did not.

None. Neither positive nor negative. You would not have achieved the same result without nesting a group without capture within one with capture that way ... Again, it's a completely normal structure.

In fact, as a general rule, you should always use groups without capture (?: ... ) when you do not need to get the text with which it matched. A group without capture does not occupy unnecessary memory (neither in capturing the text nor in generating the indices of the initial and final positions).

  • If you are interested in entering much more in detail, a group without capture is just slower to compile, but more efficient when executing. However, this difference is negligible, and people usually prefer to save memory (it is better seen from the point of view of good practices).


Yapa, one more correction. Use a structure such as:

([a-zA-Z]+_?[a-zA-Z]+?)

is an error. You are consecutively repeating 2 constructions that match the same. Since the _ is optional, the regex can be converted to [a-zA-Z]+[a-zA-Z]+? , and such a construction is the perfect recipe for a backtracking catastrophic .

This is a problem that will not generate an error in the cases you are looking at, but with a slightly more complicated regex, longer texts and a condition that does not match, could cause the browser to freeze without returning a result.

Let's see a test, not so drastic, but obvious enough:

const regex = /^([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+)$/,
      N = 1000,
      texto = 'X_'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
            + '_1_2_ERROR';

//Tu regex
let a, b, resultado;

a = performance.now()
for (let i = 0; i < N; i++) {
    resultado = regex.exec(texto);
}
b = performance.now();

console.log('"([a-zA-Z]+_?[a-zA-Z]+?)" Tardó:', (b - a), 'ms. en devolver:', resultado);


//Con un grupo sin captura anidado
const regexConGrupo = /^([a-zA-Z]+)_([a-zA-Z]+(?:_[a-zA-Z]+)?)((?:_\d+)+)$/;
a = performance.now()
for (let i = 0; i < N; i++) {
    resultado = regexConGrupo.exec(texto);
}
b = performance.now();

console.log('"([a-zA-Z]+(?:_[a-zA-Z]+)?)" Tardó:', (b - a), 'ms. en devolver:', resultado);

And this, if it were part of a more complicated regex could bring you serious problems.

Also, when using ([a-zA-Z]+_?[a-zA-Z]+?) , you're demanding that it have at least 2 characters, so it would not match something like A_B_1 .

    
answered by 06.03.2018 / 01:16
source
-1

The truth is that it has no implication. A group of non-capture serves simply to group an expression for convenience, without the result being returned in a group, this does not mean that it can not be part of another group.

Considering the following example:

"cababaabc".match(/c(a|b)*c/).slice(1) // => ["b"]

I do not get the group of a 's and b ' s, but a group that may not interest me: the last a or last b of the expression a|b

If I use a non-capture group:

"cababaabc".match(/c(?:a|b)*c/).slice(1) // => []

I do not get any group.

But if I am interested in knowing the complete chain between both c , I am obliged to put a group, completely enclosing the expression of interest, including * :

"cababaabc".match(/c((?:a|b)*)c/).slice(1) // => ["ababaab"]

Getting the full group of a 's and b ' s

EDITION:

If what you are interested in is comparing your 2 expressions:

/([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+(?:_\d+)*)/

and

/([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+)/

Let me tell you that they are completely equivalent:

The last group in both:

((?:_\d+)+(?:_\d+)*)
((?:_\d+)+)

It's the same as:

((?:A)+(?:A)*)
((?:A)+)

With A = _\d+ and in the first:

(?:A)+(?:A)* is equivalent to A+A* which is undoubtedly the same as A+

Notice that you are not even capturing the same non-capture group, but a different one:

((?:A)+) the quantifier + makes it a different expression, even if it was the same expression, there is nothing to prevent capturing the same group:

((A)) is as valid as ((?:A))

    
answered by 03.03.2018 в 19:47