Create a dictionary from text from a dataframe

Question

Create a dictionary from text from a dataframe

Navigation

#1 by (0 votes)

1

I have a DataFrame that is created from a txt file. I need to go through each row, find pairs of repeated words and calculate the frequency of appearance.

I thought to create a dictionary that stores the pairs of words and with a counter go calculating the frequencies, but I do not know how to do the for to go through the matrix in the DataFrame. I have evaluated this option but it tells me that len can not be used in DataFrame and that the indexes of the dictionary and DataFrame can not be defined. Below I attach a short example of the data (lines of words) and the idea of code that I have:

lines_res: ['Energy', ' Environmental', ' Health ', ' Supply']
           ['Energy', ' Health', 'OR in health services', ' Supply ']
           [' Inventory', ' Supply']

pares={}
for i in lines_res:
    for j in range(len(i)):
        for k in range(j+1,len(i)):
            if (i[j],i[k]) in pares.keys():
                pares[i[j],i[k]]+=1
            elif (i[k],i[j]) in pares.keys():
                pares[(i[k],i[j])]+=1
            else:
                pares[(i[j],i[k])]=1

python python-3.x pandas

asked by Ulises 28.02.2018 в 22:46

source

1 answer

Auth laravel 5.5 with a different table Error connecting Bluetooth printer Andoid Studio

score 0 · Accepted Answer

There are some gaps in your question, such as the concrete structure of the dataframe, if you have to ignore the spaces that appear around some of the words, if you have to ignore uppercase / lowercase, if the couple ('Health', 'Supply ') is the same as (' Supply ',' Health '), etc.

I have made some reasonable assumptions: the spaces at the beginning and end of the chains are ignored, the case difference is not significant, and the order of the words in the tuple is irrelevant.

With these hypotheses, the code that would process the list that you set as an example would be the following. I use standard python libraries to facilitate the task, such as itertools.combinations() to generate the tuples of each row, or collections.defaultdict() to have a dictionary that is automatically initialized with 0 the first time an element is inserted.

import itertools
import collections
lines_res = [['Energy', ' Environmental', ' Supply', ' Health '],
           ['Energy', ' Health', 'OR in health services', ' Supply '],
           [' Inventory', ' Supply']]
pares = collections.defaultdict(int)
for fila in lines_res:
    # Eliminar espacios al principio y final y pasar a minúsculas
    fila = (palabra.strip().lower() for palabra in fila)
    for caso in itertools.combinations(fila, 2):
        # Reordenar alfabeticamente la tupla para el orden no importe
        caso = tuple(sorted(caso))
        # Incrementar ese caso (gracias a defaultdict empieza siendo 0)
        pares[caso]+=1

The result is:

{('energy', 'environmental'): 1,
 ('energy', 'health'): 2,
 ('energy', 'or in health services'): 1,
 ('energy', 'supply'): 2,
 ('environmental', 'health'): 1,
 ('environmental', 'supply'): 1,
 ('health', 'or in health services'): 1,
 ('health', 'supply'): 2,
 ('inventory', 'supply'): 1,
 ('or in health services', 'supply'): 1}

What I suppose is what you were looking for (although it is not what comes out executing the code that you put in, because you did not ignore the spaces).

Pandas dataframes

Now, if the data instead of being in a list is in a pandas dataframe, I'm going to assume that in that dataframe each row corresponds to one of the elements of your list. This creates some problems, since in your example not all rows are of the same length, so there will be None in some of the cells.

In short, I suppose this is the dataframe:

>>> df = pd.DataFrame(lines_res)

In that case you can iterate through each row using df.iterrows() . This function returns one tuple per row. The first element is the index of the row (which we can ignore) and the second is a pandas.Series with the contents of the row, by which you can iterate normally (but beware, you have to skip the None ).

The following code would process this dataframe (it is analogous to the one I used before for the list, but with the mentioned precautions):

pares = collections.defaultdict(int)
for _, fila in df.iterrows():
    fila = (palabra.strip().lower() for palabra in fila if palabra is not None)
    for caso in itertools.combinations(fila, 2):
        caso = tuple(sorted(caso))
        pares[caso]+=1

The code uses generators to generate the elements to be processed. For example, the assignment to fila does not produce a list (that would consume memory), but a lazy generator that only produces values as an iterator asks for them. itertools.combinations() also returns a generator instead of a list. As a result, memory is not being wasted by storing intermediate lists, and this code could process large volumes of information.