Separately capture sections that have a name, but do not match the next section

2

Text: I have the following text (where the match's will be made):

Header 1
Codigo : c001
Nombre : Juan
Total  : 45,78



Header 1
c001
Nombre : Juan
Cantidad : 23
Subtotal : 45.89

Total     : 3410.67

Header 1
Codigo : c002
Nombre : Ana
Total  : 45.89    
Header 1
c001
Nombre : Juan
Cantidad:4

Objective: As noted, Juan has 3 sections (the first 2 and the last). I try to separately obtain the Juan sections (using your code: c001 - in some sections there may not be the word Codigo ).

Code: This is what I tried

/(Header1.*?c001.*?)Header1/ism

Start with the word Header 1 and, since there is no text that delimits the section, then I use the following Header 1 .

Problem: does not match all sections, and in some matches it takes more than one section.

Questions:

  • What is the best regular expression that helps me capture each section of Juan , assuming that each section is variable, where the headers Header 1 mark the beginning and end of the section?

  • What if I want to identify them only by their code c001 ?

  • asked by Walter Zamalloa Perez 22.03.2018 в 04:32
    source

    1 answer

    6
      

    What would be the regular expression that helps me to capture each section of "Juan", assuming that each section is variable and the only thing that you have to identify them is your code, in addition to the headings "Header 1"

    To match the lines between the Header and the name, you have to consume all the lines that follow it, as long as the beginning of this line is not followed by a Header :

    ^Header 1(?:\R(?!Header ).*+)*?
    


    And after the name, which coincides with the same, all the lines that may be within the same section: regex101

    /^Header 1$(?:\R(?!Header ).*+)*?\RNombre : Juan$(?:\R(?!Header ).*+)*/mi
    


    Logic

    Subpatrón Descripción
    ^Header 1$ Línea completa que coincide con "Header 1"
    (modificador /m para que ^ y $ sean inicio/fin de línea)
    (?: )*? Es un grupo que repite el subpatrón cero o más veces:
    \R(?!Header ).*+ Un salto de línea, que no esté seguido por "Header ",
    y coincide con toda la línea
    \RNombre : Juan$ Una línea completa que coincide con el nombre buscado
    (?:\R(?!Header ).*+)* Más líneas que no empiezan con "Header "

    The important thing here is that for each line break \R , we are using a negative inspection ( negative lookahead ) to ensure that it is not followed by a new section:
    \R(?!Header ) .

    This structure is used to find a match, but returns true or false, without advancing the pointer of the current position. A negative inspection (?! ... ) matches only when the current position no is followed by the pattern within the inspection. PHP calls it - bad! - statements .


    Search for "code" instead of name

    If instead of by name, we search for code, just replace \RNombre : Juan$ with the pattern that interests you. For example,

    • if we search for the code to be Codigo : xxxx or xxxx exclusively :

      \R(?:Codigo : )?c001$
      
    • or that appears on any line at the end, we use \b to guarantee that it is a complete word:

      \R.*\bc001$
      
    • or anywhere on the line, regardless of whether it is part of another code such as abc00123 :

      \R.*c001.*+
      

    Example:

    /^Header 1$(?:\R(?!Header ).*+)*?\R(?:Codigo : )?c001$(?:\R(?!Header ).*+)*/mi
    


    Code:

    To find all matches we use preg_match_all () .

    $regex = '/^Header 1$(?:\R(?!Header ).*+)*?\RNombre : Juan$(?:\R(?!Header ).*+)*/mi';
    
    if (preg_match_all($regex, $texto, $resultado)) {
        //mostrar secciones
        $n = 0;
        foreach ($resultado[0] as &$seccion) {
            echo "\n-----Seccion " . ++$n . "-----\n";
            echo $seccion;
        }
    } else {
        echo "No se encontró el nombre";
    }
    

    Result:

    -----Seccion 1-----
    Header 1
    Codigo : c001
    Nombre : Juan
    Total  : 45,78
    
    ... etc (las 3 secciones)
    

    Demo:

    link

        
    answered by 22.03.2018 / 04:52
    source