How to work with datetime

0

My problem is that I do not know how I can do to work with datetime but faster, when I do for it takes almost 1 day to run

df2 = pd.DataFrame({'fecha1': ["2018-3-3","2018-4-15"], 
                    'fecha2':["2018-02-31","2018-02-27"]})

df2["nueva"]=0
for i in range (len(df2)):
        If df2.fecha1.iloc[I] == "2018-3-3"
            Df2["nueva"].iloc[I]=df2.fecha1.iloc[I+1]-df2.fecha1.iloc[I]
        else:
            df2["nueva"].iloc[I]=0

I do not know if I'm misusing the for or when remaining the datetime with the 00:00:00 of seg and min takes a lot, but if I do it in a df of one hundred thousand raws It takes 1 day

    
asked by Emanuel Lemos 04.01.2019 в 02:12
source

1 answer

0

It is not at all clear what you intend with the example, because apart from the syntactic errors indicated in my commentary the logic is also confused. And it contains invalid dates, such as February 31. In short, a disaster: -)

Apparently, you want to calculate the distance between two successive dates in the fecha1 column, but only if the value of the date is "2018-3-3" and otherwise the result would be zero, and assign that result to a new column.

The code will fail if the date "2018-3-3" appears last, as there is no "next" date from which to subtract. On top of that, the fecha2 column does not participate at any time, which also seems weird.

You say in the comment that "it was only an example", from which I infer that this is not the real problem that you intend to solve. But that real problem must exist, since you mention that it takes "almost a day" to execute. It would be much more productive for all of you to ask about your real problem, instead of an artificial example whose solution may have nothing to do with the real problem.

In any case, as a general rule:

  • Dates must be of type datetime if you want to operate with them (how to calculate their differences), and not "string" like the ones you are using, which can not be subtracted.
  • If you can avoid it never iterate for a DataFrame. All the power of Pandas comes from that he has his own iterators much faster. If at the end you are going to make your own loops it does not make sense to use pandas and you could have it all in a standard Python list. Pandas is able to do "vector" operations between columns, such as df.a + df.b for example, adding each element in column a with each element in column b , without having to implement any loop that iterates.

That said, and although this example is completely artificial, I'll try to replicate your question, but without the logical errors.

First, create a DataFrame with a single column ( fecha1 ), since fecha2 is useless in the example. This column will have 100 entries generated at random. For this I write a function that returns dates of 2018, with the month between 1 and 4 and the day between 1 and 10. The dates that this function generates are strings, but using pd.to_datetime() I convert them to datetime to be able to operate with them later:

import pandas as pd
import random

random.seed(2)   # Para que el resultado sea reproducible

def fecha_aleatoria():
  "Fecha aleatoria de los cuatro primeros meses de 2018, con el dia entre 1 y 10"
  mes = random.randint(1,4)
  dia = random.randint(1,10)
  return "2018-%02d-%02d" % (mes, dia)

fechas = pd.to_datetime([fecha_aleatoria() for i in range(100)])

df2 = pd.DataFrame({'fecha1': fechas})

This is the aspect of the dataframe (only the first five rows):

      fecha1
0 2018-01-02
1 2018-01-06
2 2018-02-05
3 2018-03-10
4 2018-02-10

We will now do what you suggested in the example, that is, a new column called "nueva" that has the subtraction between the next date and the current one, provided that the current one is 2018-03-03 (in this example that date only appears twice, so the result will be zero in almost all cases). But we will do it without loops .

The simplest (and quickest to use loops) is to calculate the difference between each row with the next, regardless of the value of the date. Pandas can do this vectorically in a short time:

df2["nueva"] = df2.fecha1.shift(-1) - df2.fecha1

The operator shift() moves (in this case one row up) the entire series, and then subtracts the other series df2.fecha1 . The result looks like this:

      fecha1    nueva
0 2018-01-02   4 days
1 2018-01-06  30 days
2 2018-02-05  33 days
3 2018-03-10 -28 days
4 2018-02-10 -31 days

You see how in the "new" column you have the temporary distance between that date and the next one.

Now we will fill in with zero all the unwanted cases, which are those for which fecha1 is different from 2018-03-03 , again without loops .

df2.loc[df2.fecha1 != pd.to_datetime("2018-03-03"), "nueva"] = 0

By means of loc[filas, columnas] we can select certain cells of the dataframe and assign them zero. The selected rows are those with a different date of 2018-03-03 , and the column is "new".

This is the result (only first five rows):

      fecha1 nueva
0 2018-01-02     0
1 2018-01-06     0
2 2018-02-05     0
3 2018-03-10     0
4 2018-02-10     0

The result is not very interesting because none of those dates was the one sought. But it appears later, in row 71:

       fecha1            nueva
70 2018-04-08                0
71 2018-03-03  2 days 00:00:00
72 2018-03-05                0
73 2018-03-10                0
74 2018-04-01                0
75 2018-02-01                0
    
answered by 04.01.2019 в 15:12