Filter data when loading CSV using Pandas in Python 3.x

Question

Filter data when loading CSV using Pandas in Python 3.x

Navigation

#1 by (3 votes)

4

I am uploading very large CSV files where there is information that I do not use. What I do now is go to lists and then fit with conditions.

For example, in a column I have "A" and "B" but I only want those rows that have "A", so I filter the "B" and I do it for all the other lists I load from Pandas. I think what I'm doing is very inefficient, of loading Pandas and then converting the columns to lists to filter them.

There is some more efficient way to filter the data. I handle the data in this way in Pandas:

   import pandas as pd
   df = pd.read_csv('C:\Users\4209456\Downloads\TECNICO\BBDD FD.csv', header=0, sep=';',parse_dates = ['Date'],dayfirst = True)

and then I create lists of some columns:

   FD=df["FD Text"]
   Fecha=pd.to_datetime(df["Date"],dayfirst = True)

From there for both filter lists according to a condition of the list "FD", creating another list "FD2", within the condition I have the list "Date" creating a corrected list "Date2". In this way I have the lists that I need to start my code, which are "FD2" and "Date2", keeping the original positions of the data that remained of the filtration.

python

asked by Jorge Ponti 22.08.2017 в 12:56

source

1 answer

segmentation fault Enlarge 100% box using display: flex

score 3 · Accepted Answer

There are several ways to filter rows in Pandas, the simplest thing is to create a mask (Boolean array) using a conditional on the column in question and then filter the rows of the Dataframe with it. Basically it is the procedure that is followed in NumPy to filter arrays. This is known as Boolean indexing .

We can create a small example to see it:

import pandas as pd
from io import StringIO

data = StringIO('''
Nombre,Edad,Sexo,Fecha
Juan,12,M,20/08/2017
Laura,21,F,24/07/2017
Pedro,53,M,13/01/2017
María,17,F,15/03/2017
Luís,19,M,15/07/2017
Miguel,23,M,14/08/2017
''')

df = pd.read_csv(data, header=0, parse_dates = ['Fecha'], dayfirst = True)

We therefore have the following Dataframe:

>>> df

   Nombre  Edad Sexo      Fecha
0    Juan    12    M 2017-08-20
1   Laura    21    F 2017-07-24
2   Pedro    53    M 2017-01-13
3   María    17    F 2017-03-15
4    Luís    19    M 2017-07-15
5  Miguel    23    M 2017-08-14

We can filter using the sexo column to obtain another Dataframe that only contains the rows corresponding to women simply by doing:

>>> df2 = df[df['Sexo'] == 'F']
>>> df

  Nombre  Edad Sexo      Fecha
1  Laura    21    F 2017-07-24
3  María    17    F 2017-03-15

With df['Sexo'] == 'F' you simply create a mask that in this case is a series of Pandas and that contains a single Boolean column: [False, True, False, True, False, False] result of comparing if each value of the column equals 'F' . We can filter with any other iterable boolean values, for example an array of NumPy, a list, a column of another DataFrame, etc.

Another example, filtering those who are 18 or older:

>>> df2 = df[df['Edad'] >= 18]

>>> df2

   Nombre  Edad Sexo      Fecha
1   Laura    21    F 2017-07-24
2   Pedro    53    M 2017-01-13
4    Luís    19    M 2017-07-15
5  Miguel    23    M 2017-08-14

Edit:

You can filter by dates following the same idea, for example we can filter the row with Fecha between the current date and 30 days ago:

>>> fecha_limite = pd.datetime.now().date() - pd.Timedelta(days=30)
>>> df2 = df[(df['Fecha'] > fecha_limite)]
>>> df2

   Nombre  Edad Sexo      Fecha
0    Juan    12    M 2017-08-20
1   Laura    21    F 2017-07-24
5  Miguel    23    M 2017-08-14

You can also use several conditions, for example the previous condition but also be a woman:

>>> fecha_limite = pd.datetime.now().date() - pd.Timedelta(days=30)
>>> df2 = df[(df['Fecha'] > fecha_limite) & (df['Sexo'] == 'F')]
>>> df2

  Nombre  Edad Sexo      Fecha
1  Laura    21    F 2017-07-24

You can filter using the index in the same way or using loc , for example, if our column Fecha were in the DataFrame index (DateTimeIndex) we can do:

>>> fecha_limite = pd.datetime.now().date() - pd.Timedelta(days=30)
>>> df2 = df[df.index.date > fecha_limite]