Python Groupby pandas sums of strings and floating

1

I am trying to unify certain columns of my csv with pandas. The problem is that I want to add the values of Imp and Rev but the Categories I want them to come together and look something like this:

sitio_id;url;Imp;Rev;Categorias
http://url.com/;19472;6,31;Sports, Auto Racing, Bodybuilding, Boxing, Pro Basketball, Tennis, Volleyball, World Soccer, Auto Racing, Bodybuilding, Boxing, Pro Basketball, Tennis, Volleyball, World Soccer

csv:

sitio_id;url;Imp; Rev ; eCPM ;Categorias
104521;http://url.com/;18984;15.8;0.83;Sports, Auto Racing, Bodybuilding, Boxing, Pro Basketball, Tennis, Volleyball, World Soccer
70209;http://url.com/;488;0.51;1.04;Sports, Auto Racing, Bodybuilding, Boxing, Pro Basketball, Tennis, Volleyball, World Soccer

Try using a group by:

df = df.groupby(['sitio_id', 'url'])['Imp','Rev'].sum()['Categorias'].apply(lambda x: "{%s}" % ', '.join(x))
    
asked by Martin Bouhier 16.05.2018 в 20:37
source

2 answers

1

For completing the answer of FJSevilla, you may be interested that the result in "Categories" is not a mere concatenation of the group's chains, because repeated categories may appear.

You can pass your own functions to the .agg() method. The function you write will receive as a parameter a series whose elements are the cells of the group, and your function must return a result that groups them in some way into a single value.

The following function divides the contents of each cell (which is a string) by the commas, accumulates the words in a set (which ensures that the repeated ones appear only once) and finally concatenates the elements of the set by alphabetically arranging them and separating them by commas:

def juntar(datos):
  categorias = set()
  for d in datos:
    categorias.update(d.split(","))
  return ", ".join(sorted(categorias))

To use this function together with agg() , it's as simple as:

df.groupby(['sitio_id', 'url']).agg(
     { "Imp": "sum",
       "Rev": "sum", 
       "Categorias": juntar})

As an example, based on these data:

sitio_id;url;Imp;Rev;eCPM;Categorias
104521;http://url.com/;18984;15.8;0.83;Sports, Boxing, Pro Basketball, Tennis, Volleyball, World Soccer
104521;http://url.com/;488;0.51;1.04;Sports, Auto Racing, Bodybuilding, Boxing, Pro Basketball, Tennis, World Soccer

The result would be:

                            Imp    Rev                                                                                          Categorias
sitio_id url                                                                                                                              
104521   http://url.com/  19472  16.31   Auto Racing,  Bodybuilding,  Boxing,  Pro Basketball,  Tennis,  Volleyball,  World Soccer, Sports

Where you can see that the categories come out neat and without repetitions.

    
answered by 16.05.2018 / 22:38
source
1

The problem is that you are trying to use the column Categorias on the output of df.groupby(['sitio_id', 'url'])['Imp','Rev'].sum() , which column column will not have.

To apply different functions to columns when you use groupby you can use agg and pass a dictionary with couples columna: función . In your case it would be something like this:

import io
import pandas as pd


csv = io.StringIO(
'''sitio_id;url;Imp;Rev;eCPM;Categorias
104521;http://url.com/;18984;15.8;0.83;Sports, Auto Racing, Bodybuilding, Boxing, Pro Basketball, Tennis, Volleyball, World Soccer
104521;http://url.com/;488;0.51;1.04;Sports, Auto Racing, Bodybuilding, Boxing, Pro Basketball, Tennis, Volleyball, World Soccer
''')


df = pd.read_csv(csv, sep=";", skipinitialspace=True)
df2 = df.groupby(['sitio_id', 'url']).agg({'Imp': 'sum',
                                           'Rev': 'sum',
                                           "Categorias": lambda x: ', '.join(x)})

The content of df2 is:

                            Imp    Rev                                                                                                                                                                                  Categorias
sitio_id url                             
104521   http://url.com/  19472  16.31   'Sports, Auto Racing, Bodybuilding, Boxing, Pro Basketball, Tennis, Volleyball, World Soccer, Sports, Auto Racing, Bodybuilding, Boxing, Pro Basketball, Tennis, Volleyball, World Soccer'  
    
answered by 16.05.2018 в 22:24