How to categorize a dataframe with python

1

I am using python and pandas to sort a dataframe that I have catogorized using a column with Boolean data like the following:

df:

X    Y    PROB
2    4    False
3    5    False
3    2    False
4    4    True
3    7    True
2    4    False
2    3    False

I want to obtain 2 new dataframes that present the data 'X' and 'Y' with consecutive False or True consecutive as follows for False:

X   Y  PROB
2   4   1
3   5   1
3   2   1
2   4   2  
2   3   2

in the case of the real ones:

X   Y  PROB
4   4   1
3   7   1

So far I'm using factorize but I can not get the correct syntax to present the data, any ideas?

    
asked by Jonathan Pacheco 31.08.2017 в 21:32
source

1 answer

1

First let's create a reproducible example based on the one you provide:

import pandas as pd


data = {'X': [2, 3, 3, 4, 3, 2, 2], 
        'Y': [4, 5, 2, 4, 7, 4, 3], 
        'PROB': [False, False, False, True, True, False, False]
        }
df = pd.DataFrame(data, columns = ['X', 'Y', 'PROB'])

To solve these cases, a very simple way is to use pandas.DataFrame.shift to compare each element with the previous one and see if they are the same. If used together with pandas.DataFrame.cumsum we obtain the numbered categories. For example:

>>> df['Categorias'] = (df.PROB != df.PROB.shift()).cumsum()
>>> df

   X  Y   PROB  Categorias
0  2  4  False           1
1  3  5  False           1
2  3  2  False           1
3  4  4   True           2
4  3  7   True           2
5  2  4  False           3
6  2  3  False           3

In your case, you want the numbering of the categories to be independent for each sub-dataframe obtained when separating according to the column PROB . For this we can re-apply the same operation on each DataFrame obtained. To get the "true" and "false" just use the PROB column as a Boolean mask:

aux = (df.PROB != df.PROB.shift()).cumsum()    

falsos = df[~df.PROB].copy()
falsos['PROB']=(aux[~df.PROB]!=aux[~df.PROB].shift()).cumsum()

verdaderos = df[df.PROB].copy()
verdaderos['PROB']=(aux[df.PROB]!=aux[df.PROB].shift()).cumsum()

del(aux) 

Exit:

>>> df

   X  Y   PROB
0  2  4  False
1  3  5  False
2  3  2  False
3  4  4   True
4  3  7   True
5  2  4  False
6  2  3  False

>>> verdaderos

   X  Y  PROB
3  4  4     1
4  3  7     1

>>> falsos

   X  Y  PROB
0  2  4     1
1  3  5     1
2  3  2     1
5  2  4     2
6  2  3     2
    
answered by 01.09.2017 в 01:57