In Python Pandas how can I filter a dataFrame with part of the content of a string?

2

I have a dataFrame that has a column named 'linea' . I want to filter the dataFrame only with the lines that have the word "GRANEL" in its content and discard all other lines. To filter currently I use the code:

lista_produccion_terminado_granel = lista_produccion_terminado[(lista_produccion_terminado['linea']=='P3 EMB. A GRANEL')]

But this only serves me if the entire chain is the same. If there is a cell that has the value "P1 EMB.A GRANEL" it no longer works.

Can you suggest how to filter the dataFrame using part of the string in the condition?

    
asked by Alejandro Gomez 19.01.2018 в 20:37
source

1 answer

2

Use pandas.Series.str.contains to create the Boolean mask with only the rows that have the substring in that column. An example:

>>> import pandas as pd

>>> data = {"col1": ("abcd efg", "granel", "ab granel cd", "afgdv", "ab GRANEL", "12458"),
            "col2": (45,45,89,23,55,77)}

>>> df = pd.DataFrame(data)
>>> filtro = df[df['col1'].str.contains("GRANEL", case=False)]
>>> filtro

           col1  col2
1        granel    45
2  ab granel cd    89
4     ab GRANEL    55

The case argument indicates whether or not to differentiate between uppercase and lowercase. If you only want to validate "BULK" and not "bulk", "Bulk", etc. it is enough to let it give you a value of True .

Note that the operator in of Python is used, so "abcgraneldfg" would also be validated. If this possibility could exist and you only want to validate the individual word, you should use a regular expression:

>>> import pandas as pd

>>> data = {"col1": ("¿GRANEL?", "GRANEL ABC", "GRANEL", "GRANELCD", "A GRANEL.", "12458"),
            "col2": (45,45,89,23,55,77)}

>>> df = pd.DataFrame(data)
>>> filtro = df[df['col1'].str.contains(r"\bGRANEL\b", regex=True)]
>>> filtro
         col1  col2
0    ¿GRANEL?    45
1  GRANEL ABC    45
2      GRANEL    89
4   A GRANEL.    55
    
answered by 19.01.2018 / 21:01
source