Fill rows with missing dates in DataFrame

2

I am working with some data from a meteorological station in which the data is distributed by columns in day with 24 hours, predominant direction, wind direction in degrees

The problem is that there are days, even full months, that do not come and what I need is to find those missing dates, create them for each column and fill in Nan . I've been trying with Pandas but I can not get it, the data goes something like this:

fecha hora dpre vel dirgr
01/01/2018 01:00 sur 3.9 156 

This is a sample of the data.

    
asked by Jose.P 16.04.2018 в 21:25
source

1 answer

1

The simplest thing I think it can be to create a DateTimeIndex by joining the columns fecha and hora into one. Once this is done, you can use pandas.DatetimeIndex and pandas.DataFrame.reindex or pandas.DataFrame.asfreq to reindex the DataFrame generating all the missing rows.

Let's simulate a csv file using io.StringIO :

import io

data = io.StringIO('''\
fecha,hora,dpre,vel,dirgr
01/01/2018,01:40,sur,3.9,100
01/01/2018,02:00,este,3.5,110
01/01/2018,04:00,oeste,2.9,112
01/01/2018,05:00,oeste,1.5,50
01/01/2018,23:00,norte,3.3,16
27/02/2018,21:00,sur,2.9,115
''')

Now we can load the csv, but we will join both columns in one and place it as an index (important to pass dayfirst=True because by default it tries to parse the dates of the form mm / dd / yyyy):

import pandas as pd

df = pd.read_csv(data, dayfirst=True, index_col=0,
                 parse_dates = {'Fecha': ['fecha','hora']}
                 )

Which generates the following DataFrame:

                      dpre  vel  dirgr
Fecha                                 
2018-01-01 01:40:00    sur  3.9    100
2018-01-01 02:00:00   este  3.5    110
2018-01-01 04:00:00  oeste  2.9    112
2018-01-01 05:00:00  oeste  1.5     50
2018-01-01 23:00:00  norte  3.3     16
2018-02-27 21:00:00    sur  2.9    115

Now we are going to create all the rows that are missing to be able to have 24 rows per day (one per hour) and this for all the months from the first one that appears until the last one.

import datetime
import calendar

ultima_fecha = max(df.index)
ultimo_dia_mes = calendar.monthrange(ultima_fecha.year, ultima_fecha.month)[1]
inicio = min(df.index).replace(day=1, hour=0, minute=0, second=0)
fin = ultima_fecha.replace(day=ultimo_dia_mes, hour=23, minute=0, second=0)

index = pd.DatetimeIndex(start=inicio, end=fin, freq="1h")
df = df.reindex(index)
  

inicio is 00:00:00 hours on the first day of the first month that appears in the DataFrame. fin is 23:00:00 hours on the last day of the last month that appears in the DataFrame.

This gives us as a result:

>>> df
                      dpre  vel  dirgr
2018-01-01 00:00:00    NaN  NaN    NaN
2018-01-01 01:00:00    NaN  NaN    NaN
2018-01-01 02:00:00   este  3.5  110.0
2018-01-01 03:00:00    NaN  NaN    NaN
2018-01-01 04:00:00  oeste  2.9  112.0
2018-01-01 05:00:00  oeste  1.5   50.0
2018-01-01 06:00:00    NaN  NaN    NaN
2018-01-01 07:00:00    NaN  NaN    NaN
2018-01-01 08:00:00    NaN  NaN    NaN
2018-01-01 09:00:00    NaN  NaN    NaN
2018-01-01 10:00:00    NaN  NaN    NaN
2018-01-01 11:00:00    NaN  NaN    NaN
2018-01-01 12:00:00    NaN  NaN    NaN
2018-01-01 13:00:00    NaN  NaN    NaN
2018-01-01 14:00:00    NaN  NaN    NaN
2018-01-01 15:00:00    NaN  NaN    NaN
2018-01-01 16:00:00    NaN  NaN    NaN
2018-01-01 17:00:00    NaN  NaN    NaN
2018-01-01 18:00:00    NaN  NaN    NaN
2018-01-01 19:00:00    NaN  NaN    NaN
2018-01-01 20:00:00    NaN  NaN    NaN
2018-01-01 21:00:00    NaN  NaN    NaN
2018-01-01 22:00:00    NaN  NaN    NaN
2018-01-01 23:00:00  norte  3.3   16.0
2018-01-02 00:00:00    NaN  NaN    NaN
2018-01-02 01:00:00    NaN  NaN    NaN
2018-01-02 02:00:00    NaN  NaN    NaN
2018-01-02 03:00:00    NaN  NaN    NaN
2018-01-02 04:00:00    NaN  NaN    NaN
2018-01-02 05:00:00    NaN  NaN    NaN
...                    ...  ...    ...
2018-02-27 18:00:00    NaN  NaN    NaN
2018-02-27 19:00:00    NaN  NaN    NaN
2018-02-27 20:00:00    NaN  NaN    NaN
2018-02-27 21:00:00    sur  2.9  115.0
2018-02-27 22:00:00    NaN  NaN    NaN
2018-02-27 23:00:00    NaN  NaN    NaN
2018-02-28 00:00:00    NaN  NaN    NaN
2018-02-28 01:00:00    NaN  NaN    NaN
2018-02-28 02:00:00    NaN  NaN    NaN
2018-02-28 03:00:00    NaN  NaN    NaN
2018-02-28 04:00:00    NaN  NaN    NaN
2018-02-28 05:00:00    NaN  NaN    NaN
2018-02-28 06:00:00    NaN  NaN    NaN
2018-02-28 07:00:00    NaN  NaN    NaN
2018-02-28 08:00:00    NaN  NaN    NaN
2018-02-28 09:00:00    NaN  NaN    NaN
2018-02-28 10:00:00    NaN  NaN    NaN
2018-02-28 11:00:00    NaN  NaN    NaN
2018-02-28 12:00:00    NaN  NaN    NaN
2018-02-28 13:00:00    NaN  NaN    NaN
2018-02-28 14:00:00    NaN  NaN    NaN
2018-02-28 15:00:00    NaN  NaN    NaN
2018-02-28 16:00:00    NaN  NaN    NaN
2018-02-28 17:00:00    NaN  NaN    NaN
2018-02-28 18:00:00    NaN  NaN    NaN
2018-02-28 19:00:00    NaN  NaN    NaN
2018-02-28 20:00:00    NaN  NaN    NaN
2018-02-28 21:00:00    NaN  NaN    NaN
2018-02-28 22:00:00    NaN  NaN    NaN
2018-02-28 23:00:00    NaN  NaN    NaN

[1416 rows x 3 columns]

If, for whatever reason, we do not want to have the date and time in the same column, we can always recreate both columns from the index and reset the latter:

df.insert(0, "hora", df.index.time)
df.insert(0, "fecha", df.index.date)
df.reset_index(inplace=True, drop=True)

With what we get:

>>> df
           fecha      hora   dpre  vel  dirgr
0     2018-01-01  00:00:00    NaN  NaN    NaN
1     2018-01-01  01:00:00    NaN  NaN    NaN
2     2018-01-01  02:00:00   este  3.5  110.0
3     2018-01-01  03:00:00    NaN  NaN    NaN
4     2018-01-01  04:00:00  oeste  2.9  112.0
5     2018-01-01  05:00:00  oeste  1.5   50.0
6     2018-01-01  06:00:00    NaN  NaN    NaN
7     2018-01-01  07:00:00    NaN  NaN    NaN
8     2018-01-01  08:00:00    NaN  NaN    NaN
9     2018-01-01  09:00:00    NaN  NaN    NaN
10    2018-01-01  10:00:00    NaN  NaN    NaN
11    2018-01-01  11:00:00    NaN  NaN    NaN
12    2018-01-01  12:00:00    NaN  NaN    NaN
13    2018-01-01  13:00:00    NaN  NaN    NaN
14    2018-01-01  14:00:00    NaN  NaN    NaN
15    2018-01-01  15:00:00    NaN  NaN    NaN
16    2018-01-01  16:00:00    NaN  NaN    NaN
17    2018-01-01  17:00:00    NaN  NaN    NaN
18    2018-01-01  18:00:00    NaN  NaN    NaN
19    2018-01-01  19:00:00    NaN  NaN    NaN
20    2018-01-01  20:00:00    NaN  NaN    NaN
21    2018-01-01  21:00:00    NaN  NaN    NaN
22    2018-01-01  22:00:00    NaN  NaN    NaN
23    2018-01-01  23:00:00  norte  3.3   16.0
24    2018-01-02  00:00:00    NaN  NaN    NaN
25    2018-01-02  01:00:00    NaN  NaN    NaN
26    2018-01-02  02:00:00    NaN  NaN    NaN
27    2018-01-02  03:00:00    NaN  NaN    NaN
28    2018-01-02  04:00:00    NaN  NaN    NaN
29    2018-01-02  05:00:00    NaN  NaN    NaN
...          ...       ...    ...  ...    ...
1386  2018-02-27  18:00:00    NaN  NaN    NaN
1387  2018-02-27  19:00:00    NaN  NaN    NaN
1388  2018-02-27  20:00:00    NaN  NaN    NaN
1389  2018-02-27  21:00:00    sur  2.9  115.0
1390  2018-02-27  22:00:00    NaN  NaN    NaN
1391  2018-02-27  23:00:00    NaN  NaN    NaN
1392  2018-02-28  00:00:00    NaN  NaN    NaN
1393  2018-02-28  01:00:00    NaN  NaN    NaN
1394  2018-02-28  02:00:00    NaN  NaN    NaN
1395  2018-02-28  03:00:00    NaN  NaN    NaN
1396  2018-02-28  04:00:00    NaN  NaN    NaN
1397  2018-02-28  05:00:00    NaN  NaN    NaN
1398  2018-02-28  06:00:00    NaN  NaN    NaN
1399  2018-02-28  07:00:00    NaN  NaN    NaN
1400  2018-02-28  08:00:00    NaN  NaN    NaN
1401  2018-02-28  09:00:00    NaN  NaN    NaN
1402  2018-02-28  10:00:00    NaN  NaN    NaN
1403  2018-02-28  11:00:00    NaN  NaN    NaN
1404  2018-02-28  12:00:00    NaN  NaN    NaN
1405  2018-02-28  13:00:00    NaN  NaN    NaN
1406  2018-02-28  14:00:00    NaN  NaN    NaN
1407  2018-02-28  15:00:00    NaN  NaN    NaN
1408  2018-02-28  16:00:00    NaN  NaN    NaN
1409  2018-02-28  17:00:00    NaN  NaN    NaN
1410  2018-02-28  18:00:00    NaN  NaN    NaN
1411  2018-02-28  19:00:00    NaN  NaN    NaN
1412  2018-02-28  20:00:00    NaN  NaN    NaN
1413  2018-02-28  21:00:00    NaN  NaN    NaN
1414  2018-02-28  22:00:00    NaN  NaN    NaN
1415  2018-02-28  23:00:00    NaN  NaN    NaN

[1416 rows x 5 columns]
  

Note: Keep in mind that 8760 or 8784 rows are generated per year. If our DataFrame covers multiple years, it must be taken into account in order to incur a lack of memory.

EDITING

Python and Pandas for the 24-hour format ( %H ) accept hours in the range 0..23. Since you have hours of the 24:00:00 form the parseado to datetime fails.

If you have no way of correcting this in origin when creating the csv, the option is to create your own function that is responsible for formatting the dates, although keep in mind that this means a reduction in the efficiency of the csv load:

import io
import pandas as pd
import datetime
import calendar


data = io.StringIO('''\
fecha,hora,dpre,vel,dirgr
01/01/2018,02:40,sur,3.9,100
01/01/2018,02:00,este,3.5,110
01/01/2018,04:00,oeste,2.9,112
01/01/2018,05:00,oeste,1.5,50
01/01/2018,24:00:00,norte,3.3,16
27/02/2018,00:40,sur,2.9,115
''')

def date_parse(cad):
    try:
        return pd.to_datetime(cad, dayfirst=True)
    except ValueError:
        date = pd.to_datetime(cad.split(" ", 1)[0] + " 00:00", dayfirst=True)
        date += datetime.timedelta(days=1)
        return date



df = pd.read_csv(data, index_col=0,
                 date_parser = date_parse,
                 parse_dates = {'Fecha': ['fecha','hora']},
                 )

This allows to pass the date 01/01/2018 24:00:00 to 02/01/2018 00:00:00 accepted format without problems. With what we get the following DF:

                      dpre  vel  dirgr
Fecha                                 
2018-01-01 02:40:00    sur  3.9    100
2018-01-01 02:00:00   este  3.5    110
2018-01-01 04:00:00  oeste  2.9    112
2018-01-01 05:00:00  oeste  1.5     50
2018-01-02 00:00:00  norte  3.3     16
2018-02-27 00:40:00    sur  2.9    115

Now we can continue from here with the previous code to create the missing rows.

    
answered by 17.04.2018 / 00:28
source