Problem with pandas.DataFrame.cumsum function

Question

Problem with pandas.DataFrame.cumsum function

Navigation

#1 by (0 votes)

1

I have the following dataframe in python:

month = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4]
active = [1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1]
data1 = [1709.1,3869.7,4230.4,4656.9,48566.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,93738.2,189293.2,194412.6,206585.8]
df = pd.DataFrame({
                    'month' : month,
                    'active' : active,
                    'd1' : data1,
                    'calculate' : 0,
                });

and I need to calculate the column 'calculate', in the following way:

    month  active        d1  calculate
0       1       1    1709.1     569.70
1       2       1    3869.7    1859.60
2       3       1    4230.4    3269.73
3       4       1    4656.9    4822.03
4       5       0   48566.0       0.00
5       6       0       0.0       0.00
6       7       0       0.0       0.00
7       8       0       0.0       0.00
8       9       0       0.0       0.00
9      10       0       0.0       0.00
10     11       0       0.0       0.00
11     12       0       0.0       0.00
12     13       0       0.0       0.00
13     14       0       0.0       0.00
14     15       0       0.0       0.00
15     16       0       0.0       0.00
16     17       0       0.0       0.00
17     18       0       0.0       0.00
18     19       0       0.0       0.00
19     20       0       0.0       0.00
20      1       1   93738.2   31246.07
21      2       1  189293.2   94343.80
22      3       1  194412.6  159148.00
23      4       1  206585.8  228009.93

I'm doing it in the following way:

df['calculate'] = np.where(
                        df.month > 1,
                        np.where(
                                    df.active,
                                    (df.d1/3).cumsum(),
                                    0,
                        ),
                        (df['d1']/3)
                    )

but the result is not what was expected:

    month  active        d1      calculate
0       1       1    1709.1     569.700000
1       2       1    3869.7    1859.600000
2       3       1    4230.4    3269.733333
3       4       1    4656.9    4822.033333
4       5       0   48566.0       0.000000
5       6       0       0.0       0.000000
6       7       0       0.0       0.000000
7       8       0       0.0       0.000000
8       9       0       0.0       0.000000
9      10       0       0.0       0.000000
10     11       0       0.0       0.000000
11     12       0       0.0       0.000000
12     13       0       0.0       0.000000
13     14       0       0.0       0.000000
14     15       0       0.0       0.000000
15     16       0       0.0       0.000000
16     17       0       0.0       0.000000
17     18       0       0.0       0.000000
18     19       0       0.0       0.000000
19     20       0       0.0       0.000000
20      1       1   93738.2   31246.066667
21      2       1  189293.2  115354.500000
22      3       1  194412.6  180158.700000
23      4       1  206585.8  249020.633333

I do not know if I'm clear on my application, I thank whoever can help me.

python python-3.x numpy pandas

asked by Yeison H. Arias 27.09.2018 в 15:14

source

1 answer

Android sqlite INNER JOIN that returns an array How to deny url access if you have not logged in

score 0 · Accepted Answer

If I'm not wrong, you intend to calculate the sum accumulated by groups, grouping first by month and then within each previous group group again according to the activity periods (contiguous rows with value 1 in active ).

One possibility is to use groupby to get the groups and then apply pandas.Series.groupby.cumsum . For detect each group we can help with shift and cumsum .

import pandas as pd


month = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4]
active = [1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1]
data1 = [1709.1,3869.7,4230.4,4656.9,48566.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,93738.2,189293.2,194412.6,206585.8]
df = pd.DataFrame({
                    'month' : month,
                    'active' : active,
                    'd1' : data1,
                    'calculate' : 0,
                })




month_parts = (df["month"] < df["month"].shift()).cumsum()
mask = df["active"] == 1
active_parts = (mask != mask.shift(1)).cumsum()

aux = df.loc[mask, "d1"] / 3

df['calculate'] = aux.groupby([month_parts, active_parts]).cumsum()
df['calculate'].fillna(0, inplace=True)

>>> df
    month  active        d1      calculate
0       1       1    1709.1     569.700000
1       2       1    3869.7    1859.600000
2       3       1    4230.4    3269.733333
3       4       1    4656.9    4822.033333
4       5       0   48566.0       0.000000
5       6       0       0.0       0.000000
6       7       0       0.0       0.000000
7       8       0       0.0       0.000000
8       9       0       0.0       0.000000
9      10       0       0.0       0.000000
10     11       0       0.0       0.000000
11     12       0       0.0       0.000000
12     13       0       0.0       0.000000
13     14       0       0.0       0.000000
14     15       0       0.0       0.000000
15     16       0       0.0       0.000000
16     17       0       0.0       0.000000
17     18       0       0.0       0.000000
18     19       0       0.0       0.000000
19     20       0       0.0       0.000000
20      1       1   93738.2   31246.066667
21      2       1  189293.2   94343.800000
22      3       1  194412.6  159148.000000
23      4       1  206585.8  228009.933333

The example is not very good if you really want this since there is a coincidence between the activity period and the start of a new rank in month , an example that best shows how the code works:

month = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4]
active = [1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
data1 = [1709.1,3869.7,4230.4,4656.9,48566.0,0,4000.0,5231.4,0,0,0,0,0,0,0,0,45215.2,2154.1,451.2,14523.21,93738.2,189293.2,194412.6,206585.8]
df = pd.DataFrame({
                    'month' : month,
                    'active' : active,
                    'd1' : data1,
                    'calculate' : 0,
                })


month_parts = (df["month"] < df["month"].shift()).cumsum()
mask = df["active"] == 1
active_parts = (mask != mask.shift(1)).cumsum()

aux = df.loc[mask, "d1"] / 3

df['calculate'] = aux.groupby([month_parts, active_parts]).cumsum()
df['calculate'].fillna(0, inplace=True)

>>> df
    month  active         d1      calculate
0       1       1    1709.10     569.700000  <-----
1       2       1    3869.70    1859.600000
2       3       1    4230.40    3269.733333
3       4       1    4656.90    4822.033333
4       5       0   48566.00       0.000000
5       6       0       0.00       0.000000
6       7       1    4000.00    1333.333333  <-----
7       8       1    5231.40    3077.133333
8       9       0       0.00       0.000000
9      10       0       0.00       0.000000
10     11       0       0.00       0.000000
11     12       0       0.00       0.000000
12     13       0       0.00       0.000000
13     14       0       0.00       0.000000
14     15       0       0.00       0.000000
15     16       0       0.00       0.000000
16     17       1   45215.20   15071.733333  <-----
17     18       1    2154.10   15789.766667
18     19       1     451.20   15940.166667
19     20       1   14523.21   20781.236667
20      1       1   93738.20   31246.066667  <-----
21      2       1  189293.20   94343.800000
22      3       1  194412.60  159148.000000
23      4       1  206585.80  228009.933333