You can throw a thread so that the dataframe is loaded on it. You will need to load it into a global variable so that it is accessible from the main program.
The following code can serve as "basic skeleton". I have simulated the loading times by sleep()
, because the dataframe that I loaded ( obtained from here It loads too fast.
import pandas as pd
import threading
import time
# url = "http://spatialkeydocs.s3.amazonaws.com/FL_insurance_sample.csv.zip"
url = "FL_insurance_sample.csv"
df = None
def cargar_dataframe(filename):
global df
print("Inicio la carga del dataframe {}".format(filename))
df = pd.read_csv(filename)
# Hagamos que espere 2 segundos más
time.sleep(2)
print("Terminé la carga del dataframe")
def main():
print("Arrancado programa principal")
t = threading.Thread(target=cargar_dataframe, args=(url,))
t.start()
print("El programa principal tarda 5 segundos en iniciarse")
for i in range(5):
print("Quedan {}s en el programa principal".format(5-i))
time.sleep(1)
print("Programa principal inicializado. Esperando por dataframe")
t.join()
print("Dataframe disponible")
print(df.head())
if __name__ == "__main__":
main()
The main program creates a thread in which to execute the function cargar_dataframe
and starts it. Then he dedicates himself to his things (in this case a loop that is printing a message every second), and then waits (by t.join()
) for the thread to end. In this example the dataframe loads before the main program has reached join()
so by the time the main program arrives, it will not have to wait ( t.join()
will return immediately). As proof that the dataframe has been read, I print it.
This is what comes out when you run it:
Arrancado programa principal
Inicio la carga del dataframe FL_insurance_sample.csv
El programa principal tarda 5 segundos en iniciarse
Quedan 5s en el programa principal
Quedan 4s en el programa principal
Quedan 3s en el programa principal
Terminé la carga del dataframe
Quedan 2s en el programa principal
Quedan 1s en el programa principal
Programa principal inicializado. Esperando por dataframe
Dataframe disponible
policyID statecode county eq_site_limit ... point_longitude line construction point_granularity
0 119736 FL CLAY COUNTY 498960.0 ... -81.711777 Residential Masonry 1
1 448094 FL CLAY COUNTY 1322376.3 ... -81.707664 Residential Masonry 3
2 206893 FL CLAY COUNTY 190724.4 ... -81.700455 Residential Wood 1
3 333743 FL CLAY COUNTY 0.0 ... -81.707703 Residential Wood 3
4 172534 FL CLAY COUNTY 0.0 ... -81.702675 Residential Wood 1
[5 rows x 18 columns]
Notes
The use of threads in Python does not usually improve the execution times if the problem is CPU-intensive, because due to the existence of a global lock (the GIL), the threads take turns executing instructions and can not execute two instructions at once, even if you had several CPUs (cores).
However, if the problem is input / output, one thread may be waiting for data from the disk and meanwhile another can use the CPU. Here you would notice speed improvement. Also if what executes one of the threads is a python extension written in C, because in this case the extension does not use the GIL and it can be executed in parallel to the other thread that uses pure python. In this case it is also applicable, since many parts of Pandas are written in C, probably the one that reads CSV is one of them.
Therefore we can expect an improvement with this strategy.
I have read many times on the internet that if threads do not give you speed improvements, use processes instead (module multiprocessing
). In theory the advice is good, since the processes do not share memory, and therefore do not interfere with the GIL, and thus several cores could be used. However I do not think that for this case it was appropriate, because precisely because of not sharing memory, you could not make pandas read the csv in one process but leave it in a variable of the other process. The multiprocessing
module deals with communicating processes between them, via sockets, and with intermediate conversions to the pickle format. This would be tremendously inefficient in this case because, as you specify, the dataframe to be transferred is very large.