How to iterate in a dataframe for a certain number of rows?

4

I have the following dataframe:

import pandas as pd
import numpy as np

data = pd.date_range('20180101', periods=300)
df = pd.DataFrame(np.random.randn(300, 5), index=data,
              columns=['open', 'high', 'low', 'close', 'volume'])

I want to iterate every two rows, that is, take the value of every two rows and then do something with them, some way with pandas?

    
asked by fredmanre 27.12.2018 в 14:54
source

3 answers

1

Introduction

Pandas iterating is not usually the best solution. Most of the times you want to iterate is because you want to be doing some kind of calculation with the contents of the rows. Pandas implements a large number of "vectorized" calculations, which means that you perform them in a single line of code, taking into account many rows at a time, without having to implement your loop (it is Pandas who internally makes the iterations, usually delegating in numpy and in native code in C, much faster and more efficient than doing it in Python).

For example, to obtain the sum of the column "volume" of the dataframe df , you could think about iterating through all the rows and getting the value of that column to accumulate it, something like this:

suma = 0
for v in df.volume
    suma += v

But it's much shorter and more efficient:

suma = df.volume.sum()

Let's not say whether you want to calculate the average value of all columns. The solution with a loop would require maintaining an "accumulator" variable for each column, probably the use of df.iterrows() , etc. while the pandas solution is simply df.mean() .

If instead of wanting to act on all the rows you want only a subset of them, you can apply df.loc or df.iloc and put in an expression in square brackets the range of values of the index (for the first case) or row numbers (for the second case).

For example, since in your case the index is of type datetime , you can do something like the following to operate only with the data corresponding to January:

df.loc["2018-01":"2018-02"]

and on that selection do any operation, such as .sum() , .mean() , etc.

Or act only on the first 50 rows with:

df.iloc[0:50]

In your case

You ask to take the value every two rows. At .iloc[] you spend a normal slice python, so you can use the syntax [inicio:fin:paso] and put any desired value in the step. So, for example, the following would select all even rows:

df.iloc[::2]

and the following all odd:

df.iloc[1::2]

Example:

>>> print(df.iloc[::2].agg(["count", "mean"]))
             open        high         low       close      volume
count  150.000000  150.000000  150.000000  150.000000  150.000000
mean    -0.248744   -0.086711    0.021593    0.024451    0.157441
    
answered by 28.12.2018 / 12:46
source
0

What I think you can do is a while or a for. for example:

a=0

while(True):
   while(a<2):
     data = pd.date_range('20180101', periods=300)
     df = pd.DataFrame(np.random.randn(300, 5), index=data,
              columns=['open', 'high', 'low', 'close', 'volume'])

     a = 0
     ////hacer lo que tengas pensado despues

Then it will automatically return to the while

    
answered by 27.12.2018 в 15:22
0

The most elementary way is to iterate through a numerical index of 2 steps and to make "slices" (cuts) on dataframe by this numeric index:

for i in range(0, len(df), 2):
  row1 = df[i:i+1]
  row2 = df[i+1:i+2]

In row1 and row2 you get a DataFrame with a single row corresponding to the first and second of each iteration. Another simpler way, but getting a Serie is to use iloc()

for i in range(0, len(df), 2):
  row1 = df.iloc[i]
  row2 = df.iloc[i+1]

Obviously, these solutions assume a total of even rows.

    
answered by 27.12.2018 в 15:51