Python MemoryError, maximum limit?

0

I need to create a matrix NxN where N=55000 , the problem is that I do not know if I reach the memory limit supported by python. When I run this line:

metric_space = np.zeros((N,N))

simply returns me:

MemoryError

is there any way to create a matrix with these dimensions?

    
asked by Juanca M 15.05.2017 в 12:08
source

1 answer

4

The problem is not due to NumPy or Python but because you are creating an array that needs more memory space (RAM + virtual memory) than your computer has.

numpy.zeros() by default creates an array of float64 . Taking this into account:

  
  • 55000 * 55000 = 3025000000 floats.
  •   
  • 3025000000 floats * 64 bits / float = 193600000000 bits.
  •   
  • 193600000000 bits / 8 bits / byte = 24200000000 bytes.
  •   
  • 24200000000 bytes / 10 ** 9 bytes / Gbyte = 24.2 Gbytes approximately.
  •   

Given this, assuming it is necessary to address your problem using matrices of this size, we have several options:

  • Use numpy.memmap to create the matrix in the hard drive and work on it. The basic thing would be:

    import numpy as np
    
    N = 55000
    filename = 'metric_space.dat'
    metric_space = np.memmap(filename, dtype='float64', mode='w+', shape=(N,N))
    

    This creates us a persistent disk file on which we can work without melting the RAM. We can use the matrix at another time and keep working with it by simply changing the opening mode to 'r+' so as not to overwrite it.

  • Another option is if you are going to work with sparse matrices (where most items have a value of 0) ( Sparse matrix ), use scipy.sparse.csc_matrix .

    import scipy
    import scipy.sparse
    
    N = 55000
    
    metric_space = scipy.sparse.csc_matrix((N,N), dtype= scipy.float64 ).todense()
    

    In this case, the matrix is not saved on the disk but stored more efficiently in memory. This does not prevent at a given time you can also exceed the available memory on your computer since storage efficiency depends on the number of elements that are 0 in your matrix at a given time. For this reason, we must be careful in its use. You can create that matrix without problems, but if all its elements are not 0 (for example, doing metric_space + 1.2 ) you will have another memory error.

  • Use NumPy next to PyTables .

  • Depending on your problem and specific needs, you could take many other paths such as having your algorithm work with submatrices instead of a matrix of this size, use SFrame , etc.

Keep in mind that no matter how efficient the algorithm used, working with data stored on disk will always be slower than with data loaded in physical memory (the read / write on the HDD / SSD is going to be a neck of bottle always). Depending on your real case, you can adjust your code to try to work as much as possible in memory (using submatrices for example) and limit the maximum writing / reading on the disk, but this depends on what you try to do and the needs of efficiency you have.

    
answered by 15.05.2017 / 14:42
source