Define in which quartile values of a column are found

2

From a column that contains different numerical values called 'col6' I want to know for each sample in which quartile it is. To represent it I want to have 4 columns and do it in a binary way.

What I had thought is to create 4 columns one per quartile and go checking. I did this:

# Creo columnas y pongo todo a 0
df["Col6_Q1"] = 0
df["Col6_Q2"] = 0
df["Col6_Q3"] = 0
df["Col6_Q4"] = 0

# Calculo margenes
max = df["Col6"].max()
mitad = df["Col6"].max()/2
un_cuarto = df["Col6"].max()/4
tres_cuartos = df["Col6"].max()*3/4

# Compruebo en que cuartil se encuentra y pongo su columna a 1
if (df['Col6'] <= un_cuarto):
    df = df.assign(Col6_Q1 = 1)

Only the first quartile is there because I already verified that this method does not work.

What I think is that first you have to create a column to categorize in quartiles and then the other with get_dummies() the 4 binary columns are created. But I do not know how to do it.

Health and thanks!

    
asked by NEA 14.12.2018 в 13:51
source

1 answer

2

First, the calculation of quartiles is not as simple as applying a coefficient to the maximum.

For example, the half quartile is not half the maximum, but that value below which are half the observations (also called median ). To find the values of the quartiles it is therefore necessary to examine the entire data collection and not only the maximum. Luckily pandas has a function for that DataFrame.quantile() .

For example, I'm going to create a DataFrame with a few numbers to illustrate it:

import pandas as pd
import numpy as np

data = {"n": np.random.randint(1,100, 100)}
df = pd.DataFrame(data)
print(df.head())
    n
0  54
1  20
2   6
3   8
4  55

They are random numbers between 1 and 100. With df.n.max() I see that its maximum is 99. To obtain the quartiles:

print(df.n.quantile([.25, .5, .75]))
0.25    30.00
0.50    54.00
0.75    80.25
Name: n, dtype: float64

You see for certain that the half quartile (0.50) has value 54.00 and not 49.5 as it would come to you using half the maximum.

On the other hand you see that quantile() what it returns is another dataframe, of which we are only interested in the values (30, 54, 80.25), We can capture them in a list in the following way:

qs = df.n.quantile([.25, .50, .75]).values

And now we use this list to assign the extra columns that you were looking for:

df["Q1"] = (df.n <=qs[0])+0
df["Q2"] = ((df.n <=qs[1]) & (df.n >qs[0])) +0
df["Q3"] = ((df.n <=qs[2]) & (df.n >qs[1])) +0
df["Q4"] = (df.n >qs[2]) +0

What does it mean? Take the first example:

  • df["Q1"]= will create a new column called "Q1" which will assign the result of the operation to the right of = , which is expected to be a pandas series (one column).
  • (df.n <= qs[0]) is a vector operation, in which it compares each element of the column df.n with the value of the first quartile (it was 30 as we saw). The result is True or False . This operation then gives us a boolean column.
  • + 0 is another vector operation in which 0 is added to each element of the column obtained in the previous operation. This forces you to treat the True as 1 and the False as 0 , thus producing a column of ones and zeros that is what you wanted.

All other assignments are similar and easy to interpret.

The result would be this:

print(df.head())
    n  Q1  Q2  Q3  Q4
0  54   0   1   0   0
1  20   1   0   0   0
2   6   1   0   0   0
3   8   1   0   0   0
4  55   0   0   1   0
    
answered by 14.12.2018 / 14:21
source