For fuzzy searches there are multiple tools and methods, but using factory Python we already have, with the base library difflib
that allows us to obtain a ratio
of similarity between strings. For example:
from difflib import SequenceMatcher as SM
s1 = 'Hola Mundo'
s2 = 'Hola Mundo cruel'
print(SM(None, s1, s2).ratio())
s1 = 'Hola Mundo'
s2 = 'Hola Mundo!'
print(SM(None, s1, s2).ratio())
> 0.7692307692307693
> 0.9523809523809523
In this example we measure the similarity of Hola Mundo
with other chains and we see that logically Hola Mundo!
obtains a similarity ratio greater than Hola Mundo cruel
. The idea then, would be to go through a list, and for each element, verify the ratios with respect to the elements of the second list, the greater the more similar. Something like this:
import difflib
lista1 = ["STATUS MSG PACK ACM L"]
lista2 = ["LOW LIMIT VALVE L",
"LOW LIMIT VALVE R",
"PACK ACM L",
"PACK ACM R",
"PACK L",
"PACK MODE L",]
d = difflib.Differ()
for search in lista1:
matches = sorted(lista2, key=lambda x: difflib.SequenceMatcher(None, x, search).ratio(), reverse=True)
print("{0} se compara con {1} el más parecido es {2}".format(search, matches, matches[0]))
In matches
we end up having the elements of the second list, ordered from greater similarity to smaller, the first element should be the optimal one.
Important : In this way we will always find a "like", as an additional improvement maybe you should contemplate a ratio
minimum of similarity to consider that the "matching" has been achieved, this one You can define value only by experimenting.
Better still the form suggested by FjSevilla for being more compact and because it already incorporates the logic to evaluate the minimum ratio:
matches = difflib.get_close_matches(search, possibilities = lista2, n = 1, cutoff = 0.6)
As a curiosity, it would be necessary to indicate that difflib
is based strongly on the algorithm THE GESTALT APPROACH 1987.