Select intervals of a column in a DataFrame

1

I have a DataFrame made with pandas, formed by the columns ['x','y','z0','z','arbol'] , attached screen of its structure:

From the data of that general DataFrame I have extracted all those whose tree value is equal to 1 creating another DataFrame called arbol1 . Inside the column z0 of the latter, I want to select the values that are between 0.511 and 0.530 to extract them to a .csv file.

I have managed to select the data smaller than 0.530, but I am not able to establish an interval, I only know how to sort my data, select a specific value (for example z0 = 0597) or the values that are below or above a value.

import pandas as pd

df = pd.read_table("0Arbolesavila4.txt",header=0,names=['x','y','z0','z','arbol'])

arbol1 = df[df.arbol == 1]

print(arbol1)
print(arbol1['z0'])
print(arbol1.sort_values(by='z0'))
print(arbol1.loc[arbol1['z0'] == 0.597])
print(arbol1[arbol1.ix[:,2] > 0.511])
    
asked by Deva 13.07.2018 в 11:03
source

1 answer

0

You have several possibilities, for example:

  • Use a Boolean filter similar to what you do, but using the operator & (and):

    res =  arbol1[(arbol1['z0'] > 0.511) & (arbol1['z0'] < 0.530)]
    

    At run time it is very efficient, it is a simple Boolean mask of NumPy, the problem is that each subexpression separated by the operator is first evaluated separately and requires a temporary boolean array each, so it is aggressive with the use of RAM, especially in very large data sets.

  • Use pandas.Series.between :

    res = arbol1[arbol1['z0'].between(0.511, 0.530, inclusive=False)]
    

    For this particular case it is possibly the most efficient generically of the three and it is not lost in readability.

  • Use pandas.DataFrame.query :

    res = arbol1.query('0.511 < z0 < 0.530')
    

    It is generally somewhat slower than the two previous options, but it is considerably more flexible and in many cases much more readable. That is, for very large data sets, it does not require the temporary arrays that the first option requires (use the numexpr package), so it is more efficient in terms of memory usage compared to the Boolean mask and & .

  

Note: It is not necessary to explicitly order the DataFrame for that column previously in any of the cases.

A small example based on your data:

>>> import io
>>> import pandas as pd

>>> table = io.StringIO('''
...     x        y       z0
... 22.99   73.089    0.529
... 21.78   40.149    0.512
... 14.45   42.215    0.540
... 12.12   42.144    0.503
... 32.55   51.423    0.521
... 52.21   45.412    0.509
... ''')

>>> arbol1 = pd.read_table(table, sep="\s+")
>>> arbol1
       x       y     z0
0  22.99  73.089  0.529
1  21.78  40.149  0.512
2  14.45  42.215  0.540
3  12.12  42.144  0.503
4  32.55  51.423  0.521
5  52.21  45.412  0.509

>>> arbol1[(arbol1['z0'] > 0.511) & (arbol1['z0'] < 0.530)] 
       x       y     z0
0  22.99  73.089  0.529
1  21.78  40.149  0.512
4  32.55  51.423  0.521

>>> arbol1[arbol1['z0'].between(0.511, 0.530, inclusive=False)]
       x       y     z0
0  22.99  73.089  0.529
1  21.78  40.149  0.512
4  32.55  51.423  0.521

>>> arbol1.query('0.511 < z0 < 0.530')
       x       y     z0
0  22.99  73.089  0.529
1  21.78  40.149  0.512
4  32.55  51.423  0.521
    
answered by 13.07.2018 / 11:41
source