Good morning!
I have to obtain the number of times that a complete line appears repeated in my dataframe, to then show only those lines that appear repeated and show in the last column how many times those lines appear repetitions.
This should be the result of the correct table:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2
This is my current code:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
x = data[data.duplicated(keep=False)].drop_duplicates()
return x
I get the result correctly, however the last column "nums_reps" that should show me how many times that repeated column appears, I have something lost, since I do not know how to do that accounting.
This is the result that I am currently obtaining, it is very similar to the one I should have, however I do not know how to add the accounting of the repeated rows at the end.
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
In summary, how could you perform that calculation? How could I count the number of rows that have all the attributes equal, to finally show them in a final row "num_reps".
Thanks