Text string matches in Python 3

2

Dear, does anyone know a method to find non-exact matches between strings of text?

For example:

I have the following text "STATUS MSG PACK ACM L" (column 1) and should return "PACK L" (column 2).

I have 2 lists, one written by a person that are longer texts and another that corresponds to the message to look for that is correct.

I enclose an example of the two lists: column 1 should be searched in column 2, and throw the most associated column 2 element:

link

    
asked by Jorge Ponti 27.07.2017 в 15:57
source

1 answer

4

For fuzzy searches there are multiple tools and methods, but using factory Python we already have, with the base library difflib that allows us to obtain a ratio of similarity between strings. For example:

from difflib import SequenceMatcher as SM

s1 = 'Hola Mundo'
s2 = 'Hola Mundo cruel'
print(SM(None, s1, s2).ratio())

s1 = 'Hola Mundo'
s2 = 'Hola Mundo!'
print(SM(None, s1, s2).ratio())
> 0.7692307692307693
> 0.9523809523809523

In this example we measure the similarity of Hola Mundo with other chains and we see that logically Hola Mundo! obtains a similarity ratio greater than Hola Mundo cruel . The idea then, would be to go through a list, and for each element, verify the ratios with respect to the elements of the second list, the greater the more similar. Something like this:

import difflib

lista1 = ["STATUS MSG PACK ACM L"]
lista2 = ["LOW LIMIT VALVE L",
          "LOW LIMIT VALVE R",
          "PACK ACM L",
          "PACK ACM R",
          "PACK L",
          "PACK MODE L",]

d = difflib.Differ()

for search in lista1:
  matches = sorted(lista2, key=lambda x: difflib.SequenceMatcher(None, x, search).ratio(), reverse=True)    
  print("{0} se compara con {1} el más parecido es {2}".format(search, matches, matches[0]))

In matches we end up having the elements of the second list, ordered from greater similarity to smaller, the first element should be the optimal one.

Important : In this way we will always find a "like", as an additional improvement maybe you should contemplate a ratio minimum of similarity to consider that the "matching" has been achieved, this one You can define value only by experimenting.

Better still the form suggested by FjSevilla for being more compact and because it already incorporates the logic to evaluate the minimum ratio:

matches = difflib.get_close_matches(search, possibilities = lista2, n = 1, cutoff = 0.6)

As a curiosity, it would be necessary to indicate that difflib is based strongly on the algorithm THE GESTALT APPROACH 1987.

    
answered by 27.07.2017 в 17:24