How can I read a certain line of a csv file, keeping in mind that the process was stopped?

1

This in the dilemma of how to do to read a csv of 1,000,000 records and go back to the line that process previously. This would happen when the process stops in the middle, that is, I need to return it from the last point.

I imagined how to write the line of csv in a .txt file and compare it with the index of the for loop that processes the csv and make it return from that point.

What is the logic that I should use? the processed data must be uploaded to a database DynamoDB but I should not repeat a process already read and inserted previously.

Thank you very much!

    
asked by Hugo Lesta 02.06.2017 в 03:40
source

1 answer

1

To be able to continue reading a file in the line where it was left you need to reposition the cursor in that place.

You can do what you propose, save the line number in a text file, serializing the variable with pickle or even in the database itself.

Afterwards you should go back through the file line by line until you reach the desired line, something like:

datos = 'archivo.csv'
# Cargamos la última linea leida
ultima = 991 

f = open(datos, 'r')

# Recorremos las lineas hasta posicionar el cursor en la ultima leida
for n, _ in enumerate(f):
    if n == ultima:
        break

#Leemos las lineas que queramos y vamos aumentando el contador
for _ in range(100):
    print(linea)
    ultima += 1

f.close()
#Guardamos la última linea leida para otra vez

Another option is to avoid going through the file again by positioning the cursor in the place where we left it. It is important to make sure that the file is never modified between readings , if a byte is added or deleted we will obtain unexpected results (as if we added lines in the previous example). For this we will use the methods tell (obtain the position of the cursor) and seek (to position the cursor where we want):

datos = 'archivo.csv'
# Cargamos la última posición del cursor
cursor = 1000 

f = open(datos, 'r')

# Recorremos las lineas hasta posicionar el cursor en la ultima leida
f.seek(cursor)

#Leemos las lineas que queramos y vamos aumentando el contador
for _ in range(100):
    print(linea)
    ultima += 1

f.close()
cursor = f.tell()
#Guardamos la variable cursor para reanudar en otro momento

An implementation of this last idea using pickle to serialize the data can be:

import os
import pickle

class Reader:
    def __init__(self, ruta):
        self.ruta = ruta
        self.archivo = open(ruta)
        self.cursor = 0

    def get_line(self):       
        line = self.archivo.readline()
        self.cursor = self.archivo.tell()
        return line

    def restart(self):
        self.cursor = 0
        self.archivo.seek(0) 

    def __getstate__(self):
        new_dict = self.__dict__.copy()
        del new_dict['archivo']
        return new_dict

    def __setstate__(self, dict):
        archivo = open(dict['ruta'])
        cursor = dict['cursor']
        archivo.seek(cursor)
        self.__dict__.update(dict)
        self.archivo = archivo


class TextReader:
    def __init__(self, ruta):
        self.ruta = os.path.abspath(ruta)
        self.temp = os.path.splitext(self.ruta)[0]+ '.temp'

        try:
            with open(self.temp, 'rb') as dat:
                self.reader = pickle.load(dat)
        except:
            print('fallo')
            self.reader = Reader(self.ruta)

    def save(self):
        pickle.dump(self.reader, open(self.temp, 'wb'))

    def get_lines(self, n):
        #Retorna un generador con el numero de lineas especificadas si estan disponibles
        for _ in range(n):
            line =  self.reader.get_line()
            if line:
                yield line
            else:
                break
        self.save()

    def readlines(self):
        #Retorna un generador con todas las líneas hasta el final del archivo
        while True:
            line = self.reader.get_line()
            if line:
                yield line
            else:
                break
        self.save()

    def restart(self):
        #Reinicia el cursor al inicio del documento
        self.reader.restart()

Use:

#Instanciamos pasandole la ruta del archivo a leer
f = TextReader('archivo.txt')

#Leemos las lineas que queramos y salimos de la aplicacion
for line in f.get_lines(100):
    print(line)

Now we must have created a file with the name of our file but with extension .temp which is nothing more than an inactivity of Reader serialized with pickle.

At any other time we can reread the file where we left off:

#Instanciamos de nuevo pasandole la ruta del archivo a leer
f = TextReader('archivo.txt')

#Leemos las lineas que queramos
for line in f.get_lines(100):
    print(line)

In this case 100 lines will be read but from where we left it the first time. We can use f.restart() to reread the document from the beginning (or simply delete the .temp file)

It's just an idea of how to use the cursor next to pickle to resume reading a file, it should be optimized and adjusted to your specific case to make it more efficient. And remember, the file should not be modified under any circumstances while trying to resume a reading where it was left.

    
answered by 02.06.2017 / 17:16
source