Load Dataframe in the background

0

I have a small tool made in pthon 3.6 in which I use pandas to load dataframes. I want to upload a very large .xlsx file, with about 200,000 records and I use file = pd.read_excel('archivo.xlsx')

This takes a long time and leaves the tool locked for 20 or 30 seconds, is there any way to launch that process in the background when the tool starts and when you need to use that dataframe already loaded?

    
asked by Alfredo Lopez Rodes 10.12.2018 в 14:37
source

1 answer

0

You can throw a thread so that the dataframe is loaded on it. You will need to load it into a global variable so that it is accessible from the main program.

The following code can serve as "basic skeleton". I have simulated the loading times by sleep() , because the dataframe that I loaded ( obtained from here It loads too fast.

import pandas as pd
import threading
import time

# url = "http://spatialkeydocs.s3.amazonaws.com/FL_insurance_sample.csv.zip"
url = "FL_insurance_sample.csv"

df = None

def cargar_dataframe(filename):
    global df
    print("Inicio la carga del dataframe {}".format(filename))
    df = pd.read_csv(filename)
    # Hagamos que espere 2 segundos más
    time.sleep(2)
    print("Terminé la carga del dataframe")


def main():
    print("Arrancado programa principal")
    t = threading.Thread(target=cargar_dataframe, args=(url,))
    t.start()
    print("El programa principal tarda 5 segundos en iniciarse")
    for i in range(5):
        print("Quedan {}s en el programa principal".format(5-i))
        time.sleep(1)
    print("Programa principal inicializado. Esperando por dataframe")
    t.join()
    print("Dataframe disponible")
    print(df.head())

if __name__ == "__main__":
    main()

The main program creates a thread in which to execute the function cargar_dataframe and starts it. Then he dedicates himself to his things (in this case a loop that is printing a message every second), and then waits (by t.join() ) for the thread to end. In this example the dataframe loads before the main program has reached join() so by the time the main program arrives, it will not have to wait ( t.join() will return immediately). As proof that the dataframe has been read, I print it.

This is what comes out when you run it:

Arrancado programa principal
Inicio la carga del dataframe FL_insurance_sample.csv
El programa principal tarda 5 segundos en iniciarse
Quedan 5s en el programa principal
Quedan 4s en el programa principal
Quedan 3s en el programa principal
Terminé la carga del dataframe
Quedan 2s en el programa principal
Quedan 1s en el programa principal
Programa principal inicializado. Esperando por dataframe
Dataframe disponible
   policyID statecode       county  eq_site_limit        ...          point_longitude         line  construction  point_granularity
0    119736        FL  CLAY COUNTY       498960.0        ...               -81.711777  Residential       Masonry                  1
1    448094        FL  CLAY COUNTY      1322376.3        ...               -81.707664  Residential       Masonry                  3
2    206893        FL  CLAY COUNTY       190724.4        ...               -81.700455  Residential          Wood                  1
3    333743        FL  CLAY COUNTY            0.0        ...               -81.707703  Residential          Wood                  3
4    172534        FL  CLAY COUNTY            0.0        ...               -81.702675  Residential          Wood                  1

[5 rows x 18 columns]

Notes

The use of threads in Python does not usually improve the execution times if the problem is CPU-intensive, because due to the existence of a global lock (the GIL), the threads take turns executing instructions and can not execute two instructions at once, even if you had several CPUs (cores).

However, if the problem is input / output, one thread may be waiting for data from the disk and meanwhile another can use the CPU. Here you would notice speed improvement. Also if what executes one of the threads is a python extension written in C, because in this case the extension does not use the GIL and it can be executed in parallel to the other thread that uses pure python. In this case it is also applicable, since many parts of Pandas are written in C, probably the one that reads CSV is one of them.

Therefore we can expect an improvement with this strategy.

I have read many times on the internet that if threads do not give you speed improvements, use processes instead (module multiprocessing ). In theory the advice is good, since the processes do not share memory, and therefore do not interfere with the GIL, and thus several cores could be used. However I do not think that for this case it was appropriate, because precisely because of not sharing memory, you could not make pandas read the csv in one process but leave it in a variable of the other process. The multiprocessing module deals with communicating processes between them, via sockets, and with intermediate conversions to the pickle format. This would be tremendously inefficient in this case because, as you specify, the dataframe to be transferred is very large.

    
answered by 10.12.2018 / 18:24
source