The conditions that you can put in brackets are not "normal" expressions of python. Python expressions operate between scalar data (not between series or vectors). Some comparison operators are redefined (overloaded) to work with pandas series, like the operator
== , and that's why the following works:
df['a'] results in a
Series panda with the values in column "a". In principle we can not compare an entire column with the number 2, which is what we seem to be doing. What happens is that the type
Series has redefined its comparison operator so that when you compare it with a scalar (the 2), it actually does a "vector" operation, comparing all the elements of the column with 2 and returning a
Series of Booleans (with
True in cases where it was true).
To be able to implement that trick, Pandas defines the function
Series.__eq__() , which is what Python will call when it finds a comparison
== between a
Series and something else.
Not all Python operators can be overloaded in the same way. The
in operator that you try to use, for example, can not. There is no "magic" method (equivalent to the
__eq__() of before) for the operator
So Pandas is forced to use another syntax here. The method that defines to make such comparisons is
Series.isin(conjunto) , but this
method must be invoked explicitly (ex:
df[df['a'].isin([2,3]) ) and the syntax
df[df['a'] in [2,3]) does not work.
So your example is solved like this:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(a=[2,5,7], b=[3,6,8], c=[8,5,7]))
a b c
0 2 3 8
1 5 6 5
2 7 8 7
a b c
0 2 3 8
1 5 6 5
Since this topic has come out, and only for completeness, I will mention the reason for the strange syntax of the Boolean expressions inside the brackets when selecting rows in Pandas.
If we want to filter the rows in which the column "a" is worth 2 and the column "b" is worth 3, we might be tempted to write:
>>> df[df['a']==2 and df['b']==3]
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Why has not it worked? As I explained before
df['a']==2 is a vectorized operation that is actually executing
df['a'].__eq__(2) and returns a
Series that in this case would have the values
False , indicating which rows meet the condition . Similarly
df['b'].__eq__(3) which results in the series
False , and now we want to make the
and Boolean between both series.
Unfortunately the operator
and only works when we have Boolean values on its left and right (or can be interpreted as Boolean). But what we have in this case to the left and right of
and are series. What is the Boolean value of a string such as
False ? Unlike a python list that is considered
False if it is empty or
True if not, the series Pandas do not have a truth value defined. Therefore, they can not be used as part of boolean expressions. In fact, the error is telling us just that: The truth value of a Series is ambiguous .
On the other hand, this other surprising syntax that works :
>>> df[(df['a']==2) & (df['b']==3)]
a b c
0 2 3 8
I had to change
& and put in parentheses each of the conditions
== why now it works?
It's a trick pandas. The operator
and (as it happened with
in ) can not be overloaded to give it another meaning (as we can do with
== that I can reimplement in the
__eq__() method.) But the operator
& can be overloaded , implementing the
Normally the operator
& performs the AND operation "bit by bit" between its two operands, but Pandas has overloaded it so that it performs the Boolean operation
and element by element between two series.
That is, the expression:
(df['a']==2) & (df['b']==3)
is evaluated by python as:
that does the "magic" we wanted.
Analogously '|' is implemented in
~ is implemented in
Series.__not__ , and along with that
< is implemented in
Series.__lt__() , etc. this allows us to write things like:
df[(df['a']>2) | ~(df['b']>3)]