Perl: search for proper names in a text

1

I have done this program to identify proper names in a text.

This is my text:

Vine con Pablo a la casa.
Pedro me lo dijo.
Fui con Mariano García a la cena.
Cristina Maña no come.
No me cuentes con el AGG.
Ay que ver con Ana García Villa.
Soraya Puerto de Santamaría no es Ok.

At the moment, I just want to take my own names within the sentence.

My code is this:

#!/usr/bin/perl
use warnings;
$texto = "Corpus.txt";
open(INFILE, "<", $texto) or die "Can't open < input.txt: $!";
while (my $row = <INFILE>) 
{
    #chomp $row;
    push @array, $row;   
    #print "$row\n";    
}

foreach $linea (@array) {
    # Una NE unitoken dentro de la oración. Ejemplo: Vine con Pablo a la casa.
    $linea =~ m/\s([A-Z][a-z]+)\s/;
    $pablo = $1;
    print("$pablo\n");
    #print $l;
 }

What I do not understand is why when I print $ pablo, it returns this result:

Pablo
Pablo
Mariano
Mariano
Mariano
Ana
Puerto
Puerto

I do not understand it. Why do you evaluate the first line more than once, and yet line 6, where Ana's name is, only takes it out once?

Obviously, I've only been learning to program for a few weeks. And the program is doing something that is not what I think it should do. Let's see if someone can tell me where the "fundamental error" is.

Thank you very much.

    
asked by Jorge 26.04.2018 в 04:16
source

1 answer

0

In the first line of the text, the program meets "Pablo", and saves it in the $ pablo variable.

When you get to the second line, it turns out that the pattern of the regular expression is not met ("Pedro" is at the beginning of the line, so there is no '\ s' in front of him). Then, the regular expression fails, and therefore, the value of $ 1 is not updated, which continues to be worth the above, "Pablo".

The same goes for the rest of the lines. "Mariano" appears three times because the next two lines to where the first "Mariano" is also do not match the pattern.

One way to solve it is to ask a question to see if the regular expression has found something:

if ($linea =~ m/\s([A-Z][a-z]+)\s/) {
    print "$1\n";
}
    
answered by 26.04.2018 в 12:21