R, can not allocate vector of size 1123.5 Gb

2

I have a consumption matrix with half a million observations and 187 variables when I run

fviz_nbclust(data, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")' 

gives me the following error

  

"can not allocate vector of size 1123.5 Gb

I suppose it is a problem of RAM memory capacity, what alternatives do I have to work with this matrix and that memory is not an impediment?

Note: R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32 / x64 (64-bit) Running under: Windows> = 8 x64 (build 9200)

> memory.limit()
[1] 8019
> memory.size()
[1] 1147.37
    
asked by googolplex 18.12.2018 в 21:03
source

1 answer

1

Simplifying a lot, the answer would be:

  • Install more physical memory, so you say, you need something more 1 TB
  • Hire a cloud service that offers a similar amount of memory ( Amazon for example)

Now detailing a little more, the problem basically lies in that the way to estimate the optimal number k that makes fviz_nbclust() is previously calculating a distance matrix, which grows exponentially based on the number of observations. You can eventually calculate the amount of values in this matrix:

n <- 500000
n*(n-1)/2
[1] 124999750000

By reproducing your problem in a simpler way:

tam <- c(100000, 1) # Para definir una matriz de 10.000 filas y 1 columna
m <- matrix(rnorm(tam[1]*tam[2]), ncol = tam[2])
d <- dist(m)

Error in dist(m) : cannot allocate vector of length 704982704

In my case, with a more modest team I can no longer calculate the distances of a matrix of 100,000 rows and a single column. What can be done? answering strictly the two things I mentioned before, however another approach to the problem could be to use a smaller set of data, that is, extract a smaller and more manageable sample:

small_data <- data[sample(1:1000),,drop=FALSE]
fviz_nbclust(small_data, kmeans, method = "wss") +
    geom_vline(xintercept = 3, linetype = 2)

Or, a more basic way would be, directly applying the model kmean on the complete data with different numbers of k and evaluate in each case the sum of the squares of the distances from each observation to the centroid, which eventually we could also graph, although the k number should be visually evaluated.

k.max <- 10
wss <- sapply(1:k.max, 
              function(k){kmeans(data, k, nstart=50, iter.max = 5 )$tot.withinss})
wss
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")
    
answered by 19.12.2018 / 19:01
source