Memory consumption when processing .csv file

0

First of all, I would like to thank you if someone takes the trouble to read the question and second to warn that I am a total novice. I'm doing this for a hobby.

I'm doing a program to perform queries on very large .csv files with records from different measurement points every 15 minutes every day of the month. This original file weighs about 650-700 MB.

The fact is that to make the consultations easier, the first thing I do is divide the files with the monthly records into smaller files with the readings separated by days with this code:

def separa_fechas(lista):
    if(True): #aqui en algún momento irá una verificación
        if(len(lista)==10): #Si tiene identif, lo quitamos
            lista.pop(2)

        if(len(lista)==9): #ya estaríamos con la lista en formato inicial.
            for i in range(len(lista)):
                aux=lista[i]
                aux2=aux.replace(chr(34), "") #hacemos esto porque en algunos conjuntos de datos todos los campos aparecen entrecomillados
                lista[i]=aux2

            fecha_comillas=lista[1] #el string fecha va a conservar las comillas, se las quitamos (esto es redundante con lo anterior, lo dejamos porque mal no hace)
            fecha=fecha_comillas.replace(chr(34), "") #reemplazamos " por un espacio vacio, como no se puede nombrar " usamos la funcion chr y el codigo de las comillas, 34
            lista_aux=fecha.split(" ")
            lista_dia=lista_aux[0].split("-") #aqui tenemos el dia [yyyy, mm, dd], ojo, como str
            lista_hora=lista_aux[1].split(":") #aqui tenemos la hora [hh, mm, ss], ojo, como str

            #Ahora componemos la lista nueva que va a devolver, pasamos a entero los datos correspondientes ya que salen de la linea del archivo como strings
            lista_return=[int(lista[0]), int(lista_dia[0]), int(lista_dia[1]), int(lista_dia[2]), int(lista_hora[0]), int(lista_hora[1]), lista[2], int(lista[3]), int(lista[4]), int(lista[5]), int(lista[6]), lista[7], int(lista[8])]

        return lista_return

    else:
        print("Algo ha ido mal con los datos :(")


datos=open(ruta, "r") 

contador=0 

for i in datos:
    if contador==0: #nos saltamos la primera linea del archivo, la de las definiciones
        contador+=1
        next

    else: 
        contador+=1
        lista=i.split(";") #Transformamos cada linea del archivo en lista

        lista_fecha=separa_fechas(lista) #generamos una nueva lista con la fecha separada :)

        ruta_main=os.path.dirname(os.path.abspath(__file__))

        ruta_carpeta=genera_ruta_carpeta(lista_fecha[1],lista_fecha[2],ruta_main) #nos devolverá "Directorio del main"\Datos trafico\mes_en_letra añoYYYY

        if not os.path.exists(ruta_carpeta): #Comprueba la existencia de la carpeta y si no existe la crea.
            os.makedirs(ruta_carpeta)


        ruta_archivo=genera_ruta_archivo(lista_fecha[1], lista_fecha[2], lista_fecha[3], ruta_main) #"Directorio del main"\Datos trafico\Mes_en_letra YYYY\YYYY-MM-DD

        #Ahora que tenemos la ruta del archivo comprobamos si existe o no el fichero diario con la funcion path.exists(file) que devuelve True si file existe
        if os.path.exists(ruta_archivo):
            resumen_diario=open(ruta_archivo, "a")
            resumen_diario.write("\n")
            resumen_diario.write(str(lista_fecha))
            resumen_diario.close()

        else:
            resumen_diario=open(ruta_archivo, "w")
            resumen_diario.write(str(lista_fecha))
            resumen_diario.close()

datos.close()

The fact is that processing the files sometimes windows gives an error saying something like that it has run out of memory (I have not saved the screenshot and the thing takes several hours, but if necessary I do it and I put it).

I do not understand very well where the unbridled data consumption comes from, since, at best, it has to load the large file in memory once, but from there it opens and closes the smallest files.

I'm thinking about solutions and for now I can think of 2:

  • Cut the large file into smaller ones and do the same process with those smaller files.

  • Use a pointer and open and close the large file to access the successive lines.

  • What do you think can work better? Is there any other more effective solution?

    Thanks

        
    asked by Miguel Barreiro 06.06.2018 в 10:25
    source

    2 answers

    0

    There are several possible strategies, but you should know more about the nature of the data.

    Because of what you say in the comments, for each record the file is opened and closed to write a data in it. This type of input / output operations are very slow compared to the time used in the computations. Possibly, the tests that you have done with i5 and i7 differ more in the hard disk than in the speed of the CPU.

    Try to reduce the number of disk accesses. The simplest thing is to prepare the data in a dictionary and dump it when you are ready:

    dirs = set()  # set con los directorios para crear
    data = {}  # diccionario para preparar los datos
    
    for i in datos:
      ...
      ...
      ruta_carpeta = genera_ruta_carpeta(lista_fecha[1],lista_fecha[2],ruta_main)
      dirs.add(ruta_carpeta)
    
      ruta_archivo = genera_ruta_archivo(lista_fecha[1], lista_fecha[2],
                                         lista_fecha[3], ruta_main)
      data[ruta_archivo] = lista_fecha
    
    for ruta_carpeta in dirs:
      if not os.path.exists(ruta_carpeta):
        os.makedirs(ruta_carpeta)
    
    for (ruta_archivo, lista_fecha) in data.items():
      with open(ruta_archivo, "a") as resumen_diario:
        resumen_diario.write(str(lista_fecha))
    

    Note that there is no need to distinguish between the append and write modes when opening the file, since the append mode creates the file if it does not exist, just like the write mode.

    I hope you notice improvement. Another possibility would be to use multiprocess, for which the data would have to be partitioned (for example, depending on the date of acquisition).

    Edited : Correction and improvement of the code

    As it is, you have the problem that in the dictionary data to save the data does not save all the data well. When the data[ruta_archivo] = lista_fecha assignment is made, the old value is crushed with the new one.

    A common way to solve this is using defaultdicts :

    from collections import defaultdict
    
    data = defaultdict(list)  # diccionario para preparar los datos
                              # con list como constructor por defecto
    
    for i in datos:
        ...
        ...
        data[ruta_archivo].append(lista_fecha)
    

    Right now, as items in the dictionary we have lists of values, so we have to change a bit the loop that writes in the files:

    for (ruta_archivo, listas_fechas) in data.items():
      with open(ruta_archivo, "a") as resumen_diario:
        for lista_fecha in listas_fechas:
          resumen_diario.write(lista_fecha)
    

    It should be noted that the file only opens once to write all the values, instead of once for each value.

        
    answered by 11.06.2018 / 21:34
    source
    0

    Solved the problem.

    The fault was that I executed the script from the Sublime REPL and this for some unknown reason triggers the consumption of memory.

    Running the script from the .py memory consumption was significantly smaller and the process much faster.

    Thank you very much to all the people who have answered.

    Health

        
    answered by 12.06.2018 в 22:39