Simplifying a lot, the answer would be:

- Install more physical memory, so you say, you need something more 1 TB
- Hire a cloud service that offers a similar amount of memory ( Amazon for example)

Now detailing a little more, the problem basically lies in that the way to estimate the optimal number `k`

that makes `fviz_nbclust()`

is previously calculating a distance matrix, which grows exponentially based on the number of observations. You can eventually calculate the amount of values in this matrix:

```
n <- 500000
n*(n-1)/2
[1] 124999750000
```

By reproducing your problem in a simpler way:

```
tam <- c(100000, 1) # Para definir una matriz de 10.000 filas y 1 columna
m <- matrix(rnorm(tam[1]*tam[2]), ncol = tam[2])
d <- dist(m)
Error in dist(m) : cannot allocate vector of length 704982704
```

In my case, with a more modest team I can no longer calculate the distances of a matrix of 100,000 rows and a single column. What can be done? answering strictly the two things I mentioned before, however another approach to the problem could be to use a smaller set of data, that is, extract a smaller and more manageable sample:

```
small_data <- data[sample(1:1000),,drop=FALSE]
fviz_nbclust(small_data, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)
```

Or, a more basic way would be, directly applying the model `kmean`

on the complete data with different numbers of `k`

and evaluate in each case the sum of the squares of the distances from each observation to the centroid, which eventually we could also graph, although the `k`

number should be visually evaluated.

```
k.max <- 10
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50, iter.max = 5 )$tot.withinss})
wss
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
```