First, the calculation of quartiles is not as simple as applying a coefficient to the maximum.
For example, the half quartile is not half the maximum, but that value below which are half the observations (also called median ). To find the values of the quartiles it is therefore necessary to examine the entire data collection and not only the maximum. Luckily pandas has a function for that DataFrame.quantile()
.
For example, I'm going to create a DataFrame with a few numbers to illustrate it:
import pandas as pd
import numpy as np
data = {"n": np.random.randint(1,100, 100)}
df = pd.DataFrame(data)
print(df.head())
n
0 54
1 20
2 6
3 8
4 55
They are random numbers between 1 and 100. With df.n.max()
I see that its maximum is 99. To obtain the quartiles:
print(df.n.quantile([.25, .5, .75]))
0.25 30.00
0.50 54.00
0.75 80.25
Name: n, dtype: float64
You see for certain that the half quartile (0.50) has value 54.00
and not 49.5
as it would come to you using half the maximum.
On the other hand you see that quantile()
what it returns is another dataframe, of which we are only interested in the values (30, 54, 80.25), We can capture them in a list in the following way:
qs = df.n.quantile([.25, .50, .75]).values
And now we use this list to assign the extra columns that you were looking for:
df["Q1"] = (df.n <=qs[0])+0
df["Q2"] = ((df.n <=qs[1]) & (df.n >qs[0])) +0
df["Q3"] = ((df.n <=qs[2]) & (df.n >qs[1])) +0
df["Q4"] = (df.n >qs[2]) +0
What does it mean? Take the first example:
-
df["Q1"]=
will create a new column called "Q1"
which will assign the result of the operation to the right of =
, which is expected to be a pandas series (one column).
-
(df.n <= qs[0])
is a vector operation, in which it compares each element of the column df.n
with the value of the first quartile (it was 30 as we saw). The result is True
or False
. This operation then gives us a boolean column.
-
+ 0
is another vector operation in which 0 is added to each element of the column obtained in the previous operation. This forces you to treat the True
as 1
and the False
as 0
, thus producing a column of ones and zeros that is what you wanted.
All other assignments are similar and easy to interpret.
The result would be this:
print(df.head())
n Q1 Q2 Q3 Q4
0 54 0 1 0 0
1 20 1 0 0 0
2 6 1 0 0 0
3 8 1 0 0 0
4 55 0 0 1 0