GroupBy Data Validation

Question

GroupBy Data Validation

Navigation

#1 by (1 votes)

2

It is necessary to group the items by name

Then we must see that all those with the same names also coincide in RREF1, RREF2, RREF3 and if they are not the same in the RREF that says which.

# Agrupacion por valores similares en columna, resta de valores entre dos columnas

cond1_df = (
    mydataset_df
    .groupby(["name"], as_index=False).sum()
    .assign(
        validacion=lambda row: row.RREF3 == row.RREF3
    )


)
cond1_df.head(10)

something that should come out is:

but in validation tell me which values are not equal

python pandas

asked by Yan Chirino 09.04.2018 в 20:39

source

1 answer

Operate with powers of decimals in shell How to put a container-fluid that occupies 100% of the pag

score 1 · Accepted Answer

We will start from a DataFrame first as you show to be able to reproduce the problem:

import pandas as pd


data = {"name": ("Babero", "Babero Terry", "Babero Terry", "Baly", "Baly", "Baly", "Barilla Metalica", "Base Para Portátil", "Base Para Portátil", "Base Para Portátil"),
        "ref": (8090, 6089, 8089, 3045, 3046, 3047, 141, 7188, 7190, 7191),
        "RREF1": (1, 2, 1, 1, 1, 1, 1, 1, 1, 1),
        "RREF2": (7, 6, 7, 7, 7, 0, 6, 7, 7, 7),
        "RREF3": (18, 1, 18, 15, 1, 0, 9, 18, 18, 18)
        }

mydataset_df = pd.DataFrame(data)

One option is to use transform next to pandas.DataFrame.nunique to get what you want:

mydataset_df["Válido"] = (mydataset_df.groupby(["name"], as_index=False)
                                      ["RREF1", "RREF2", "RREF3"]
                                      .transform("nunique", dropna=False)
                                      .sum(axis=1) == 3
                         )

The nunique method returns the number of unique values for each RREF column within each group. The rows of the groups that have all their RREF columns equal will have 1 as a value in all the columns returned by nunique , so the sum of their columns will be 3.

The previous code gives us the following result:

>>> mydataset_df

                 name   ref  RREF1  RREF2  RREF3  Válido
0              Babero  8090      1      7     18    True
1        Babero Terry  6089      2      6      1   False
2        Babero Terry  8089      1      7     18   False
3                Baly  3045      1      7     15   False
4                Baly  3046      1      7      1   False
5                Baly  3047      1      0      0   False
6    Barilla Metalica   141      1      6      9    True
7  Base Para Portátil  7188      1      7     18    True
8  Base Para Portátil  7190      1      7     18    True
9  Base Para Portátil  7191      1      7     18    True

Both Terry Babe and Baly are not validated since their RREF values are not equal in all rows.