Reduce the substitution with "str_replace_all" in R

1

how can I get the str_replace_all when I make the first substitution (match) and stop looking for more matches in the dictionary

code that I have.

library(stringr)

x <- c("VALLE PINO CORSO","LA PAZ","PAZ")

dictionary bad words (malpal) and good words (buenapal) I can not make changes in the order of these.

malpal.corpus <-  c("PINO CORSO","PAZ","PINO CORZO") #  patron
buenapal.corpus <- c("VALLE PINO CORZO","LA PAZ","VALLE PINO CORZO") # reemplazo

malpal.corpus <- str_c("\b",malpal.corpus,"\b")

vect.corpus <- buenapal.corpus
names(vect.corpus) <- malpal.corpus


str_replace_all(x, vect.corpus)

[1] "VALLE VALLE VALLE PINO CORZO" "LA LA PAZ"                      "LA PAZ"

What I'm looking for is that only the "str_replace_all" function leaves the first match

[1] "VALLE PINO CORZO" "LA PAZ"                      "LA PAZ"

At least I would like to reduce a VALLE similarity:

[1] "VALLE VALLE PINO CORZO" "LA LA PAZ"                      "LA PAZ"
    
asked by dogged 15.02.2018 в 23:23
source

2 answers

3

Based on the example you provide, I think that the problem, as commented to you by @Mariano is more of an exact search than the use of regular expressions. One way to solve it would be the following:

malpal.corpus <-  c("PINO CORSO","PAZ","PINO CORZO") #  patron
buenapal.corpus <- c("VALLE PINO CORZO","LA PAZ","VALLE PINO CORZO") # reemplazo
casos <- c("VALLE PINO CORSO","LA PAZ","PAZ")

replace.items <- sapply(seq_along(casos), function(x) buenapal.corpus[match(casos[x],malpal.corpus)])
ifelse(is.na(replace.items), casos, replace.items)

The result:

[1] "VALLE PINO CORSO" "LA PAZ"           "LA PAZ"

As you can see, we managed to replace "PAZ" with "LA PAZ" , not like "VALLE PINO CORSO" , but this is resolved simply by adding the new case:

malpal.corpus <-  c("VALLE PINO CORSO","PINO CORSO","PAZ","PINO CORZO") #  patron
buenapal.corpus <- c("VALLE PINO CORZO","VALLE PINO CORZO","LA PAZ","VALLE PINO CORZO") # reemplazo

The logic is relatively simple, with sapply we apply on each chain of casos , a match() to see if it matches any of the erroneous words and if so, we retrieve the correct word that corresponds to it: buenapal.corpus[match(casos[x],malpal.corpus)] . At the end in replace.items we will have a vector of the same amount of casos with the words to replace or NA in case of mismatch, so, the only thing remaining is to make the replacement: ifelse(is.na(replace.items), casos, replace.items) .

    
answered by 16.02.2018 / 14:45
source
3

If you really need regular expressions, you can write them in a way that minimizes the conflicts you have, for example:

library(stringr)
malpal.corpus <-  c("\bVALLE PINO CORSO\b|\bPINO CORSO\b","\bLA PAZ\b|\bPAZ\b") #  patron
buenapal.corpus <- c("VALLE PINO CORZO", "LA PAZ")
vect.corpus <- buenapal.corpus
names(vect.corpus) <- malpal.corpus

x <- c("VALLE PINO CORSO","LA PAZ","PAZ", "PINO CORSO")
str_replace_all(x, vect.corpus)

Exit:

[1] "VALLE PINO CORZO" "LA PAZ"           "LA PAZ"           "VALLE PINO CORZO"

If you look for example the case of "\bVALLE PINO CORSO\b|\bPINO CORSO\b" we use the | u OR , either of the two cases, the first that occurs "matcheara" with "VALLE PINO CORZO" , with which "VALLE PINO CORSO" will apply the first pattern \bVALLE PINO CORSO\b and will not cause problems.

    
answered by 16.02.2018 в 15:33