Find anagram in a text file in python

0

I am stuck in a task, in the task I need to create a function that in that function searches in a text file for a quantity x of letters anagrams, after having found each pair it will be saved as tuple and the function will return the tuple

  

example:

     

> find_anagrams(1000) [('abel', 'able'), ('aboard', 'abroad'), ('abode', 'adobe'), ('accrues', 'accuser')]

this is my code:

n=int(input("Enter a number:"))

def find_anagrams(n):
    t=0
    file = open("words.txt")
    for i in range(n):
        for j in range(n):

            if (file.readline(i).lower().sort()==file.readline(i+1).lower().sort()):
                t[i]=(file.readline(i).lower().sort(),file.readline(i+1).lower().sort())

    return t

print(find_anagrams(n))
    
asked by Julio Mizrahi 03.08.2017 в 00:48
source

1 answer

0

The first thing is to read the word file. The most comfortable thing is to read it in memory. Working directly on the file, as you pretend, is not very advisable (besides that file.readline(i) does not work the way you are using it The i parameter does not indicate the line to read, but the number of bytes to read at most).

We create an iterator lines with all the lines of the file and remove the line break from the end ( x[:-1] ). We limit this iterator to n first lines with function itertools.islice and get a list. All within a block with to ensure that the file is closed once the list is obtained:

from itertools import islice

with open("words.txt") as file:
    lines = (x[:-1].lower() for x in file.readlines())
    words = list(islice(lines, n))

Now that we have the word list, we have to think about how we make the comparisons. We can try to compare words, all against all, but it has a very high cost, O (n ^ 2) , so it would take a lot if the lists are very long. You have to try some kind of strategy that improves the results.

Something habitual is to create a dictionary that classifies us the candidate words to be anagram, so that we can limit the number of comparisons. For example, we could organize words by length and only compare words with equal length. We could also create the dictionary using the word ordered as a key (something similar to what your code was starting to do). We can even do as suggested by @ César and use the sum of the codes of all the letters as a key.

For simplicity, I am going to use the word with the ordered letters as a dictionary key:

def key(word):
    return "".join(sorted(word))

from collections import defaultdict

d = defaultdict(list)
for w in words:
    d[key(w)].append(w)

A normal dictionary could have been used instead of defaultdict , but this simplifies the code and it becomes habitual to see its use in this way.

At this moment, we have a dictionary with all the anagrams grouped by the same key. To extract the tuples, just look at the items in the dictionary that has two elements:

anagrams = [tuple(v) for v in d.values() if len(v) == 2]

The downside is that there could be more than two anagrams for the same key. For example:

"aab": ["aab","aba","baa"]

Although it is not explicit in the statement, we can think that, in these cases, all possible combinations should be drawn, that is, something like this:

("aab", "aba")
("aab", "baa")
("aba", "baa")

For this task we are doing very well itertools.combinations :

from itertools import combinations

anagrams = [tuple(r) for v in d.values() for r in combinations(v, 2)]

And with this we are finished. The complete code would be:

from collections import defaultdict
from itertools import combinations


def key(w):
  return "".join(sorted(w))

d = defaultdict(list)
for w in words:
    d[key(w)].append(w)

anagrams = [tuple(r) for v in d.values() for r in combinations(v, 2)]

Edited : added the key function. A hashable object is required as a dictionary index.

    
answered by 03.08.2017 / 17:36
source