The simplest thing I think it can be to create a DateTimeIndex by joining the columns fecha
and hora
into one. Once this is done, you can use pandas.DatetimeIndex
and pandas.DataFrame.reindex
or pandas.DataFrame.asfreq
to reindex the DataFrame generating all the missing rows.
Let's simulate a csv file using io.StringIO
:
import io
data = io.StringIO('''\
fecha,hora,dpre,vel,dirgr
01/01/2018,01:40,sur,3.9,100
01/01/2018,02:00,este,3.5,110
01/01/2018,04:00,oeste,2.9,112
01/01/2018,05:00,oeste,1.5,50
01/01/2018,23:00,norte,3.3,16
27/02/2018,21:00,sur,2.9,115
''')
Now we can load the csv, but we will join both columns in one and place it as an index (important to pass dayfirst=True
because by default it tries to parse the dates of the form mm / dd / yyyy):
import pandas as pd
df = pd.read_csv(data, dayfirst=True, index_col=0,
parse_dates = {'Fecha': ['fecha','hora']}
)
Which generates the following DataFrame:
dpre vel dirgr
Fecha
2018-01-01 01:40:00 sur 3.9 100
2018-01-01 02:00:00 este 3.5 110
2018-01-01 04:00:00 oeste 2.9 112
2018-01-01 05:00:00 oeste 1.5 50
2018-01-01 23:00:00 norte 3.3 16
2018-02-27 21:00:00 sur 2.9 115
Now we are going to create all the rows that are missing to be able to have 24 rows per day (one per hour) and this for all the months from the first one that appears until the last one.
import datetime
import calendar
ultima_fecha = max(df.index)
ultimo_dia_mes = calendar.monthrange(ultima_fecha.year, ultima_fecha.month)[1]
inicio = min(df.index).replace(day=1, hour=0, minute=0, second=0)
fin = ultima_fecha.replace(day=ultimo_dia_mes, hour=23, minute=0, second=0)
index = pd.DatetimeIndex(start=inicio, end=fin, freq="1h")
df = df.reindex(index)
inicio
is 00:00:00 hours on the first day of the first month that appears in the DataFrame. fin
is 23:00:00 hours on the last day of the last month that appears in the DataFrame.
This gives us as a result:
>>> df
dpre vel dirgr
2018-01-01 00:00:00 NaN NaN NaN
2018-01-01 01:00:00 NaN NaN NaN
2018-01-01 02:00:00 este 3.5 110.0
2018-01-01 03:00:00 NaN NaN NaN
2018-01-01 04:00:00 oeste 2.9 112.0
2018-01-01 05:00:00 oeste 1.5 50.0
2018-01-01 06:00:00 NaN NaN NaN
2018-01-01 07:00:00 NaN NaN NaN
2018-01-01 08:00:00 NaN NaN NaN
2018-01-01 09:00:00 NaN NaN NaN
2018-01-01 10:00:00 NaN NaN NaN
2018-01-01 11:00:00 NaN NaN NaN
2018-01-01 12:00:00 NaN NaN NaN
2018-01-01 13:00:00 NaN NaN NaN
2018-01-01 14:00:00 NaN NaN NaN
2018-01-01 15:00:00 NaN NaN NaN
2018-01-01 16:00:00 NaN NaN NaN
2018-01-01 17:00:00 NaN NaN NaN
2018-01-01 18:00:00 NaN NaN NaN
2018-01-01 19:00:00 NaN NaN NaN
2018-01-01 20:00:00 NaN NaN NaN
2018-01-01 21:00:00 NaN NaN NaN
2018-01-01 22:00:00 NaN NaN NaN
2018-01-01 23:00:00 norte 3.3 16.0
2018-01-02 00:00:00 NaN NaN NaN
2018-01-02 01:00:00 NaN NaN NaN
2018-01-02 02:00:00 NaN NaN NaN
2018-01-02 03:00:00 NaN NaN NaN
2018-01-02 04:00:00 NaN NaN NaN
2018-01-02 05:00:00 NaN NaN NaN
... ... ... ...
2018-02-27 18:00:00 NaN NaN NaN
2018-02-27 19:00:00 NaN NaN NaN
2018-02-27 20:00:00 NaN NaN NaN
2018-02-27 21:00:00 sur 2.9 115.0
2018-02-27 22:00:00 NaN NaN NaN
2018-02-27 23:00:00 NaN NaN NaN
2018-02-28 00:00:00 NaN NaN NaN
2018-02-28 01:00:00 NaN NaN NaN
2018-02-28 02:00:00 NaN NaN NaN
2018-02-28 03:00:00 NaN NaN NaN
2018-02-28 04:00:00 NaN NaN NaN
2018-02-28 05:00:00 NaN NaN NaN
2018-02-28 06:00:00 NaN NaN NaN
2018-02-28 07:00:00 NaN NaN NaN
2018-02-28 08:00:00 NaN NaN NaN
2018-02-28 09:00:00 NaN NaN NaN
2018-02-28 10:00:00 NaN NaN NaN
2018-02-28 11:00:00 NaN NaN NaN
2018-02-28 12:00:00 NaN NaN NaN
2018-02-28 13:00:00 NaN NaN NaN
2018-02-28 14:00:00 NaN NaN NaN
2018-02-28 15:00:00 NaN NaN NaN
2018-02-28 16:00:00 NaN NaN NaN
2018-02-28 17:00:00 NaN NaN NaN
2018-02-28 18:00:00 NaN NaN NaN
2018-02-28 19:00:00 NaN NaN NaN
2018-02-28 20:00:00 NaN NaN NaN
2018-02-28 21:00:00 NaN NaN NaN
2018-02-28 22:00:00 NaN NaN NaN
2018-02-28 23:00:00 NaN NaN NaN
[1416 rows x 3 columns]
If, for whatever reason, we do not want to have the date and time in the same column, we can always recreate both columns from the index and reset the latter:
df.insert(0, "hora", df.index.time)
df.insert(0, "fecha", df.index.date)
df.reset_index(inplace=True, drop=True)
With what we get:
>>> df
fecha hora dpre vel dirgr
0 2018-01-01 00:00:00 NaN NaN NaN
1 2018-01-01 01:00:00 NaN NaN NaN
2 2018-01-01 02:00:00 este 3.5 110.0
3 2018-01-01 03:00:00 NaN NaN NaN
4 2018-01-01 04:00:00 oeste 2.9 112.0
5 2018-01-01 05:00:00 oeste 1.5 50.0
6 2018-01-01 06:00:00 NaN NaN NaN
7 2018-01-01 07:00:00 NaN NaN NaN
8 2018-01-01 08:00:00 NaN NaN NaN
9 2018-01-01 09:00:00 NaN NaN NaN
10 2018-01-01 10:00:00 NaN NaN NaN
11 2018-01-01 11:00:00 NaN NaN NaN
12 2018-01-01 12:00:00 NaN NaN NaN
13 2018-01-01 13:00:00 NaN NaN NaN
14 2018-01-01 14:00:00 NaN NaN NaN
15 2018-01-01 15:00:00 NaN NaN NaN
16 2018-01-01 16:00:00 NaN NaN NaN
17 2018-01-01 17:00:00 NaN NaN NaN
18 2018-01-01 18:00:00 NaN NaN NaN
19 2018-01-01 19:00:00 NaN NaN NaN
20 2018-01-01 20:00:00 NaN NaN NaN
21 2018-01-01 21:00:00 NaN NaN NaN
22 2018-01-01 22:00:00 NaN NaN NaN
23 2018-01-01 23:00:00 norte 3.3 16.0
24 2018-01-02 00:00:00 NaN NaN NaN
25 2018-01-02 01:00:00 NaN NaN NaN
26 2018-01-02 02:00:00 NaN NaN NaN
27 2018-01-02 03:00:00 NaN NaN NaN
28 2018-01-02 04:00:00 NaN NaN NaN
29 2018-01-02 05:00:00 NaN NaN NaN
... ... ... ... ... ...
1386 2018-02-27 18:00:00 NaN NaN NaN
1387 2018-02-27 19:00:00 NaN NaN NaN
1388 2018-02-27 20:00:00 NaN NaN NaN
1389 2018-02-27 21:00:00 sur 2.9 115.0
1390 2018-02-27 22:00:00 NaN NaN NaN
1391 2018-02-27 23:00:00 NaN NaN NaN
1392 2018-02-28 00:00:00 NaN NaN NaN
1393 2018-02-28 01:00:00 NaN NaN NaN
1394 2018-02-28 02:00:00 NaN NaN NaN
1395 2018-02-28 03:00:00 NaN NaN NaN
1396 2018-02-28 04:00:00 NaN NaN NaN
1397 2018-02-28 05:00:00 NaN NaN NaN
1398 2018-02-28 06:00:00 NaN NaN NaN
1399 2018-02-28 07:00:00 NaN NaN NaN
1400 2018-02-28 08:00:00 NaN NaN NaN
1401 2018-02-28 09:00:00 NaN NaN NaN
1402 2018-02-28 10:00:00 NaN NaN NaN
1403 2018-02-28 11:00:00 NaN NaN NaN
1404 2018-02-28 12:00:00 NaN NaN NaN
1405 2018-02-28 13:00:00 NaN NaN NaN
1406 2018-02-28 14:00:00 NaN NaN NaN
1407 2018-02-28 15:00:00 NaN NaN NaN
1408 2018-02-28 16:00:00 NaN NaN NaN
1409 2018-02-28 17:00:00 NaN NaN NaN
1410 2018-02-28 18:00:00 NaN NaN NaN
1411 2018-02-28 19:00:00 NaN NaN NaN
1412 2018-02-28 20:00:00 NaN NaN NaN
1413 2018-02-28 21:00:00 NaN NaN NaN
1414 2018-02-28 22:00:00 NaN NaN NaN
1415 2018-02-28 23:00:00 NaN NaN NaN
[1416 rows x 5 columns]
Note: Keep in mind that 8760 or 8784 rows are generated per year. If our DataFrame covers multiple years, it must be taken into account in order to incur a lack of memory.
EDITING
Python and Pandas for the 24-hour format ( %H
) accept hours in the range 0..23. Since you have hours of the 24:00:00 form the parseado to datetime
fails.
If you have no way of correcting this in origin when creating the csv, the option is to create your own function that is responsible for formatting the dates, although keep in mind that this means a reduction in the efficiency of the csv load:
import io
import pandas as pd
import datetime
import calendar
data = io.StringIO('''\
fecha,hora,dpre,vel,dirgr
01/01/2018,02:40,sur,3.9,100
01/01/2018,02:00,este,3.5,110
01/01/2018,04:00,oeste,2.9,112
01/01/2018,05:00,oeste,1.5,50
01/01/2018,24:00:00,norte,3.3,16
27/02/2018,00:40,sur,2.9,115
''')
def date_parse(cad):
try:
return pd.to_datetime(cad, dayfirst=True)
except ValueError:
date = pd.to_datetime(cad.split(" ", 1)[0] + " 00:00", dayfirst=True)
date += datetime.timedelta(days=1)
return date
df = pd.read_csv(data, index_col=0,
date_parser = date_parse,
parse_dates = {'Fecha': ['fecha','hora']},
)
This allows to pass the date 01/01/2018 24:00:00
to 02/01/2018 00:00:00
accepted format without problems. With what we get the following DF:
dpre vel dirgr
Fecha
2018-01-01 02:40:00 sur 3.9 100
2018-01-01 02:00:00 este 3.5 110
2018-01-01 04:00:00 oeste 2.9 112
2018-01-01 05:00:00 oeste 1.5 50
2018-01-02 00:00:00 norte 3.3 16
2018-02-27 00:40:00 sur 2.9 115
Now we can continue from here with the previous code to create the missing rows.