I am doing a process that should compare 3 fields in a table in search of duplicates.
What was done was to create another table with the same records and take a record of TABLE 1 and compare it with all the records in TABLE B, this technique takes too long since we have 100 thousand records that must be compared.
It was raised in the following way by deleting the comparison of the same field in the second query.
def xNumero(campoA,campo1,CampoB,campo2,CampoC,Campo3):
resultado1= fuzz.token_sort_ratio(CampoA, Campo1)
resultado2= fuzz.token_sort_ratio(CampoB, Campo2)
resultado3= fuzz.token_sort_ratio(CampoC, Campo3)
score = (resultado1+resultado2+resultado3) / 3
return(score)
sqlOrigen = ("Select campoA,campoB,campoC from TABLA A")
cursor.execute(sqlOrigen)
for d in cursor:
campoA= d[0]
campoB= d[1]
campoC= d[2]
sqlComparacion = ("Select campo1,campo2,campo3 from TABLA B
where campo1 <> %s") % \
(campoA)
cursorConsulta.execute(sqlComparacion)
for r in cursorConsulta:
campo1= d[0]
campo2= d[1]
campo3= d[2]
It calls a function that with the fuzzywuzzy library performs the Comparison of the fields and returns the percentage of comparison.
result = xNumero(campoA,campo1,campoB,campo2,campoC,campo3)
This function returns a percentage of the total of the average of the comparisons of the 3 fields that are then determined if according to the returned value it is taken to be considered as a duplicate or not and it is saved in a separate table.
This process is very slow, how could it be improved? I think a good method would be to compare on a single table but it's just an idea.