The first thing is to read the word file. The most comfortable thing is to read it in memory. Working directly on the file, as you pretend, is not very advisable (besides that file.readline(i)
does not work the way you are using it The i
parameter does not indicate the line to read, but the number of bytes to read at most).
We create an iterator lines
with all the lines of the file and remove the line break from the end ( x[:-1]
). We limit this iterator to n
first lines with function itertools.islice
and get a list. All within a block with
to ensure that the file is closed once the list is obtained:
from itertools import islice
with open("words.txt") as file:
lines = (x[:-1].lower() for x in file.readlines())
words = list(islice(lines, n))
Now that we have the word list, we have to think about how we make the comparisons. We can try to compare words, all against all, but it has a very high cost, O (n ^ 2) , so it would take a lot if the lists are very long. You have to try some kind of strategy that improves the results.
Something habitual is to create a dictionary that classifies us the candidate words to be anagram, so that we can limit the number of comparisons. For example, we could organize words by length and only compare words with equal length. We could also create the dictionary using the word ordered as a key (something similar to what your code was starting to do). We can even do as suggested by @ César and use the sum of the codes of all the letters as a key.
For simplicity, I am going to use the word with the ordered letters as a dictionary key:
def key(word):
return "".join(sorted(word))
from collections import defaultdict
d = defaultdict(list)
for w in words:
d[key(w)].append(w)
A normal dictionary could have been used instead of defaultdict
, but this simplifies the code and it becomes habitual to see its use in this way.
At this moment, we have a dictionary with all the anagrams grouped by the same key. To extract the tuples, just look at the items in the dictionary that has two elements:
anagrams = [tuple(v) for v in d.values() if len(v) == 2]
The downside is that there could be more than two anagrams for the same key. For example:
"aab": ["aab","aba","baa"]
Although it is not explicit in the statement, we can think that, in these cases, all possible combinations should be drawn, that is, something like this:
("aab", "aba")
("aab", "baa")
("aba", "baa")
For this task we are doing very well itertools.combinations
:
from itertools import combinations
anagrams = [tuple(r) for v in d.values() for r in combinations(v, 2)]
And with this we are finished. The complete code would be:
from collections import defaultdict
from itertools import combinations
def key(w):
return "".join(sorted(w))
d = defaultdict(list)
for w in words:
d[key(w)].append(w)
anagrams = [tuple(r) for v in d.values() for r in combinations(v, 2)]
Edited : added the key
function. A hashable object is required as a dictionary index.