Simplifying a lot, the answer would be:
- Install more physical memory, so you say, you need something more 1 TB
- Hire a cloud service that offers a similar amount of memory ( Amazon for example)
Now detailing a little more, the problem basically lies in that the way to estimate the optimal number k
that makes fviz_nbclust()
is previously calculating a distance matrix, which grows exponentially based on the number of observations. You can eventually calculate the amount of values in this matrix:
n <- 500000
n*(n-1)/2
[1] 124999750000
By reproducing your problem in a simpler way:
tam <- c(100000, 1) # Para definir una matriz de 10.000 filas y 1 columna
m <- matrix(rnorm(tam[1]*tam[2]), ncol = tam[2])
d <- dist(m)
Error in dist(m) : cannot allocate vector of length 704982704
In my case, with a more modest team I can no longer calculate the distances of a matrix of 100,000 rows and a single column. What can be done? answering strictly the two things I mentioned before, however another approach to the problem could be to use a smaller set of data, that is, extract a smaller and more manageable sample:
small_data <- data[sample(1:1000),,drop=FALSE]
fviz_nbclust(small_data, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)
Or, a more basic way would be, directly applying the model kmean
on the complete data with different numbers of k
and evaluate in each case the sum of the squares of the distances from each observation to the centroid, which eventually we could also graph, although the k
number should be visually evaluated.
k.max <- 10
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50, iter.max = 5 )$tot.withinss})
wss
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")