Count number of repetitions in pairs using pandas

0

A question recently , that did not specify with what language / technology I wanted to solve it, it impelled me to think how I would do it with Pandas.

The question can be rephrased in the Pandas context as follows.

Given a dataframe such as the one I build here:

import pandas as pd
datos = [["jose", "manzana"],
         ["andres","pera"],
         ["luis", "pera"],
         ["jose", "manzana"]]
df = pd.DataFrame(datos, columns=["Nombre", "Fruta"])
print(df)

What looks like this:

   Nombre    Fruta
0    jose  manzana
1  andres     pera
2    luis     pera
3    jose  manzana

I want to get another one in which the index is the names (without repetitions), the columns the fruits, and the cells the number of times each name appears with each fruit. That is:

Fruta   manzana  pera
Nombre               
andres        0     1
jose          2     0
luis          0     1

My attempt

It seems that this is asking for pivot_table() , but in order to solve it I have not come up with another way to add an extra column filled with 1 to the original dataframe:

>>> print(df.assign(Cuantas=1))
   Nombre    Fruta  Cuantas
0    jose  manzana        1
1  andres     pera        1
2    luis     pera        1
3    jose  manzana        1

To be able to use that column as a value to add in pivot_table . My solution therefore remains as follows:

sol = df.assign(Cuantas=1).pivot_table(
        index="Nombre", 
        columns="Fruta", 
        values="Cuantas", 
        aggfunc="sum", 
        fill_value=0)

and this effectively gives the desired result, but I find it a bit cumbersome.

The question is Is there another easier way? .

    
asked by abulafia 11.05.2018 в 15:17
source

1 answer

1

What you are looking for seems to be what is called contingency table , pandas has that functionality in crosstab() , you can do something like this:

import pandas as pd

datos = [["jose", "manzana"],
         ["andres","pera"],
         ["luis", "pera"],
         ["jose", "manzana"]]

df = pd.DataFrame(datos, columns=["Nombre", "Fruta"])

print(pd.crosstab(df.Nombre, df.Fruta))

Exit:

Fruta   manzana  pera
Nombre               
andres        0     1
jose          2     0
luis          0     1
    
answered by 11.05.2018 / 16:30
source