There are some gaps in your question, such as the concrete structure of the dataframe, if you have to ignore the spaces that appear around some of the words, if you have to ignore uppercase / lowercase, if the couple ('Health', 'Supply ') is the same as (' Supply ',' Health '), etc.
I have made some reasonable assumptions: the spaces at the beginning and end of the chains are ignored, the case difference is not significant, and the order of the words in the tuple is irrelevant.
With these hypotheses, the code that would process the list that you set as an example would be the following. I use standard python libraries to facilitate the task, such as itertools.combinations()
to generate the tuples of each row, or collections.defaultdict()
to have a dictionary that is automatically initialized with 0 the first time an element is inserted.
import itertools
import collections
lines_res = [['Energy', ' Environmental', ' Supply', ' Health '],
['Energy', ' Health', 'OR in health services', ' Supply '],
[' Inventory', ' Supply']]
pares = collections.defaultdict(int)
for fila in lines_res:
# Eliminar espacios al principio y final y pasar a minúsculas
fila = (palabra.strip().lower() for palabra in fila)
for caso in itertools.combinations(fila, 2):
# Reordenar alfabeticamente la tupla para el orden no importe
caso = tuple(sorted(caso))
# Incrementar ese caso (gracias a defaultdict empieza siendo 0)
pares[caso]+=1
The result is:
{('energy', 'environmental'): 1,
('energy', 'health'): 2,
('energy', 'or in health services'): 1,
('energy', 'supply'): 2,
('environmental', 'health'): 1,
('environmental', 'supply'): 1,
('health', 'or in health services'): 1,
('health', 'supply'): 2,
('inventory', 'supply'): 1,
('or in health services', 'supply'): 1}
What I suppose is what you were looking for (although it is not what comes out executing the code that you put in, because you did not ignore the spaces).
Pandas dataframes
Now, if the data instead of being in a list is in a pandas dataframe, I'm going to assume that in that dataframe each row corresponds to one of the elements of your list. This creates some problems, since in your example not all rows are of the same length, so there will be None
in some of the cells.
In short, I suppose this is the dataframe:
>>> df = pd.DataFrame(lines_res)
In that case you can iterate through each row using df.iterrows()
. This function returns one tuple per row. The first element is the index of the row (which we can ignore) and the second is a pandas.Series
with the contents of the row, by which you can iterate normally (but beware, you have to skip the None
).
The following code would process this dataframe (it is analogous to the one I used before for the list, but with the mentioned precautions):
pares = collections.defaultdict(int)
for _, fila in df.iterrows():
fila = (palabra.strip().lower() for palabra in fila if palabra is not None)
for caso in itertools.combinations(fila, 2):
caso = tuple(sorted(caso))
pares[caso]+=1
The code uses generators to generate the elements to be processed. For example, the assignment to fila
does not produce a list (that would consume memory), but a lazy generator that only produces values as an iterator asks for them. itertools.combinations()
also returns a generator instead of a list. As a result, memory is not being wasted by storing intermediate lists, and this code could process large volumes of information.