Filter a DataFrame using a list

2

If I have the following dataframe

  a,b,c
0 2,3,8
1 5,6,5
2 7,8,7

I would like to filter the rows, but using a previously created list. Use something like this:

num=[2,5]
df=df[df['a'] in num]

I would like to obtain the following result:

  a,b,c
0 2,3,8
1 5,6,5

But I get the following error:

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), 
a.any() or a.all().

Obviously, I want to do this with a dataframe with more complex data as a string and I want to filter the rows of the dataframe by the elements of a column according to whether they belong to a list obtained by a previous process.

Greetings and thank you in advance.

    
asked by Juan Ignacio 24.04.2018 в 15:35
source

1 answer

2

The conditions that you can put in brackets are not "normal" expressions of python. Python expressions operate between scalar data (not between series or vectors). Some comparison operators are redefined (overloaded) to work with pandas series, like the operator == , and that's why the following works:

>>> df[df['a']==2]

df['a'] results in a Series panda with the values in column "a". In principle we can not compare an entire column with the number 2, which is what we seem to be doing. What happens is that the type Series has redefined its comparison operator so that when you compare it with a scalar (the 2), it actually does a "vector" operation, comparing all the elements of the column with 2 and returning a Series of Booleans (with True in cases where it was true).

To be able to implement that trick, Pandas defines the function Series.__eq__() , which is what Python will call when it finds a comparison == between a Series and something else.

Not all Python operators can be overloaded in the same way. The in operator that you try to use, for example, can not. There is no "magic" method (equivalent to the __eq__() of before) for the operator in .

So Pandas is forced to use another syntax here. The method that defines to make such comparisons is Series.isin(conjunto) , but this method must be invoked explicitly (ex: df[df['a'].isin([2,3]) ) and the syntax df[df['a'] in [2,3]) does not work.

So your example is solved like this:

>>> import pandas as pd
>>> df = pd.DataFrame(dict(a=[2,5,7], b=[3,6,8], c=[8,5,7]))
>>> print(df)
   a  b  c
0  2  3  8
1  5  6  5
2  7  8  7

>>> num=[2,5]
>>> df=df[df['a'].isin(num)]
>>> print(df)
   a  b  c
0  2  3  8
1  5  6  5

Enlargement

Since this topic has come out, and only for completeness, I will mention the reason for the strange syntax of the Boolean expressions inside the brackets when selecting rows in Pandas.

If we want to filter the rows in which the column "a" is worth 2 and the column "b" is worth 3, we might be tempted to write:

>>> df[df['a']==2 and df['b']==3]
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Why has not it worked? As I explained before df['a']==2 is a vectorized operation that is actually executing df['a'].__eq__(2) and returns a Series that in this case would have the values True , False , False , indicating which rows meet the condition . Similarly df['b']==3 executes df['b'].__eq__(3) which results in the series True , False , False , and now we want to make the and Boolean between both series.

Unfortunately the operator and only works when we have Boolean values on its left and right (or can be interpreted as Boolean). But what we have in this case to the left and right of and are series. What is the Boolean value of a string such as True , False , False ? Unlike a python list that is considered False if it is empty or True if not, the series Pandas do not have a truth value defined. Therefore, they can not be used as part of boolean expressions. In fact, the error is telling us just that: The truth value of a Series is ambiguous .

On the other hand, this other surprising syntax that works :

>>> df[(df['a']==2) & (df['b']==3)]
   a  b  c
0  2  3  8

I had to change and for & and put in parentheses each of the conditions == why now it works?

It's a trick pandas. The operator and (as it happened with in ) can not be overloaded to give it another meaning (as we can do with == that I can reimplement in the __eq__() method.) But the operator & can be overloaded , implementing the Series.__and__() method.

Normally the operator & performs the AND operation "bit by bit" between its two operands, but Pandas has overloaded it so that it performs the Boolean operation and element by element between two series.

That is, the expression:

(df['a']==2) & (df['b']==3)

is evaluated by python as:

(df['a'].__eq__(2)).__and__(df['b'].__eq__(3))

that does the "magic" we wanted.

Analogously '|' is implemented in Series.__or__() and ~ is implemented in Series.__not__ , and along with that < is implemented in Series.__lt__() , etc. this allows us to write things like:

df[(df['a']>2) | ~(df['b']>3)]
    
answered by 24.04.2018 / 17:36
source