Simplify a heavy document (1 GB)

4

I'm dealing with a Geoide model, specifically the EGM2008 :

  • This file contains the Geoid undulation (in meters) with respect to the ellipsoid WGS84, with a mesh pitch of 2.5 '(minutes)

  • Occupy 1.28 GB

  • Example of the file

| LAT  | LONG  | N (Ond. Geo) |
|------|-------|--------------|
| 90   | 0.0   | 15           |
| 90   | 0.4   | 15           |
| ...  | ...   | ...          |
| 90   | 359.8 | 15           |
| ...  |       |              |
| 89.8 | 0.0   | 18           |
| ...  | ...   | ...          |
| 89.8 | 359.8 | 15           |

To simplify it, the idea is to obtain a file with a larger mesh step, for example 1º (1 degree === 60 minutes). (Note that in this way we would obtain a value of Ripple each degree, and we would have coordinates [longitud, latitud] whole:

| LAT  | LONG  | N (Ond. Geo) |
|------|-------|--------------|
| 90   | 0.0   | 15           |
| 90   | 1     | 15           |
| ...  | ...   | ...          |
| 90   | 359   | 15           |
| ...  |       |              |
| 89   | 0.0   | 18           |
| ...  | ...   | ...          |
| 89   | 359   | 15           |

I have the following code, in which I use islice , so that, in an infinite while loop:

  • I read the lines that interest me, keeping them in a file
  • I read the lines that do not interest me, avoiding saving them in a file
  • I repeat the operation until islice does not return elements

It is this last condition that does not convince me since it does not seem very "pythonica" and I was wondering if there is any better way to approach the problem.

NOTE

  

Doing a readlines, loading the Array in memory, results in a   MemoryError

from itertools import islice

filename           = 'EGM2008_2_5min_N.dat'
paso_malla_fichero = 2.5 # En minutos
paso_malla_salida  = 60

# Cada cuantas líneas hay una longitud par
salto              = int(paso_malla_salida / paso_malla_fichero)
# Cada cuantas lineas hay una nueva latitud
salto_lat          = int(360 * (60 / paso_malla_fichero))
# Línea donde empieza un nuevo salto menos las líneas que se han leído (las que interesa extreaer)
rest_of_lines      = ( salto * salto_lat ) - (salto_lat - 1)


def parse(fileIn, fileOut):
    while True:
        lines = islice(fileIn, 0, salto_lat - 1, salto)

        try :
            fileOut.write(next(lines))
        except :
            break

        for line in lines:
            fileOut.write(line)

        for line in islice(fileIn, rest_of_lines):pass


with open(filename, 'rt') as file :
    with open('salida.txt', 'wt') as fileOut:
        parse(file, fileOut)
    
asked by Jose Hermosilla Rodrigo 06.10.2017 в 16:22
source

1 answer

2

It is not very clear to me if there is an advantage of using islice() compared to a normal sequential reading from line to line, but regardless of this, maybe I would simplify the code of parse in the following way:

cant_lineas_a_leer = salto_lat + rest_of_lines
lineas_a_salvar = [l for l in range(0, salto_lat, salto)]

def parse(fileIn, fileOut):

    while True:

      lines = list(islice(fileIn, cant_lineas_a_leer))
      if not lines:
          break

      fileOut.write("".join([l for i,l in enumerate(lines) if i in lineas_a_salvar]))

I define a first variable cant_lineas_a_leer that establishes the number of lines in block that we are going to read, in your example they would be 207361, of which as far as I understood, you are only going to process the first 8640 to recover 1 every 24, then we define a list lineas_a_salvar just with these numbers. So, when saving, we just have to concatenate the list: [l for i,l in enumerate(lines) if i in lineas_a_salvar] that just looks at the lines we want.

    
answered by 06.10.2017 / 22:27
source