# R, can not allocate vector of size 1123.5 Gb

2

I have a consumption matrix with half a million observations and 187 variables when I run

``````fviz_nbclust(data, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")'
``````

gives me the following error

"can not allocate vector of size 1123.5 Gb

I suppose it is a problem of RAM memory capacity, what alternatives do I have to work with this matrix and that memory is not an impediment?

Note: R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32 / x64 (64-bit) Running under: Windows> = 8 x64 (build 9200)

``````> memory.limit()
 8019
> memory.size()
 1147.37
``````

asked by googolplex 18.12.2018 в 21:03
source

1

Simplifying a lot, the answer would be:

• Install more physical memory, so you say, you need something more 1 TB
• Hire a cloud service that offers a similar amount of memory ( Amazon for example)

Now detailing a little more, the problem basically lies in that the way to estimate the optimal number `k` that makes `fviz_nbclust()` is previously calculating a distance matrix, which grows exponentially based on the number of observations. You can eventually calculate the amount of values in this matrix:

``````n <- 500000
n*(n-1)/2
 124999750000
``````

By reproducing your problem in a simpler way:

``````tam <- c(100000, 1) # Para definir una matriz de 10.000 filas y 1 columna
m <- matrix(rnorm(tam*tam), ncol = tam)
d <- dist(m)

Error in dist(m) : cannot allocate vector of length 704982704
``````

In my case, with a more modest team I can no longer calculate the distances of a matrix of 100,000 rows and a single column. What can be done? answering strictly the two things I mentioned before, however another approach to the problem could be to use a smaller set of data, that is, extract a smaller and more manageable sample:

``````small_data <- data[sample(1:1000),,drop=FALSE]
fviz_nbclust(small_data, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)
``````

Or, a more basic way would be, directly applying the model `kmean` on the complete data with different numbers of `k` and evaluate in each case the sum of the squares of the distances from each observation to the centroid, which eventually we could also graph, although the `k` number should be visually evaluated.

``````k.max <- 10
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50, iter.max = 5 )\$tot.withinss})
wss
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
``````