List times a column is repeated

1

Good morning!

I have to obtain the number of times that a complete line appears repeated in my dataframe, to then show only those lines that appear repeated and show in the last column how many times those lines appear repetitions.

This should be the result of the correct table:

          dur       wage1   wage2   wage3   cola    hours  pension  stby_pay    shift_diff  num_reps
     6    3.0        2.0    3.0     NaN      tcf      NaN   empl_contr  NaN      NaN           4
     8    1.0        2.8    NaN     NaN      none    38.0   empl_contr  2.0      3.0           2
     9    1.0        5.7    NaN     NaN      none    40.0   empl_contr  NaN      4.0           3
     43   2.0        2.5    3.0     NaN      NaN     40.0    none       NaN      NaN           2

This is my current code:

def detect_duplicates(data):
    x = DataFrame(columns=data.columns.tolist() + ["num_reps"])

    x = data[data.duplicated(keep=False)].drop_duplicates()

    return x

I get the result correctly, however the last column "nums_reps" that should show me how many times that repeated column appears, I have something lost, since I do not know how to do that accounting.

This is the result that I am currently obtaining, it is very similar to the one I should have, however I do not know how to add the accounting of the repeated rows at the end.

    dur  wage1  wage2  wage3  cola  hours     pension  stby_pay  shift_diff
6   3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
8   1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
9   1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
43  2.0    2.5    3.0    NaN   NaN   40.0        none       NaN         NaN

In summary, how could you perform that calculation? How could I count the number of rows that have all the attributes equal, to finally show them in a final row "num_reps".

Thanks

    
asked by fiticida 16.11.2017 в 20:35
source

1 answer

1

If you do not mind losing the original index of each row I think you can do it using only DataFrame.groupby to eliminate duplicate rows as well as to count them:

import pandas as pd

>>> data = pd.DataFrame({"Col1":("a", "a", "b", "a", "b", "c", "b", "b", "b"), 
                         "Col2":("x", "y", "z", "x", "z", "j", "z", "i", "i")})                            
>>> data

  Col1 Col2
0    a    x
1    a    y
2    b    z
3    a    x
4    b    z
5    c    j
6    b    z
7    b    i
8    b    i


>>> s = data.groupby(data.columns.tolist(), as_index=False).size()
>>> out = s[s > 1].reset_index()
>>> out.rename(columns={0: "num_reps"}, inplace=True)
>>> out

  Col1 Col2  num_reps
0    a    x         2
1    b    i         2
2    b    z         3
    
answered by 16.11.2017 в 22:43