The problem
The packages with which you are trying to do manage to execute the code in multiple threads doing SIMD (Single Instruction, Multiple Data). That's why they're appropriate when you're going to use the same function in each part of a data structure.
A typical example of use: pass the same function to each element of the list x and present the result as a list with the same length as x. Using %dopar%
the function is passed through more than one element at the same time and (in some cases) the execution time is reduced.
The important thing is that they are different data (eg each column of a data.frame) and the same function.
In your case what you are looking for is the opposite, same data, two different functions.
I do not think you're going to have the results you're looking for with the packages you're using, because they're for something else.
In your case, what interests you is doing MISD
Possible solutions
Note: parallel execution has the potential to shorten the execution time, but this is not always the case. When running in parallel we have an overhead: you have to open the threads and then gather the result. If the functions you are running do not take long, overhead losses may be higher than the parallelism gain. In that case, it's not worth parallelizing.
Apply each function in parallel
One possible solution is to not use colMeans
and rowMeans
and do a parallel iterator. If you are using Mac or Linux it is very easy with the function family mc*apply
. If you use Windows with foreach
you could do something like this:
foo <- data.frame(V1=rnorm(1000),V2=rnorm(1000),V3=rnorm(1000))
library(foreach)
library(doParallel)
# Registro un cluster con dos hilos.
registerDoParallel(makeCluster(2))
foreach(col = seq_along(foo)) %dopar% mean(foo[,col])
Here you would be running the function mean()
on each column in two threads. In this particular case, with three columns and knowing that colMeans
is optimized and mean()
no, surely the result will be slower.
You would still not be able to run the means of columns and rows at the same time, but if instead of mean()
you are working with a function with a long runtime you could use this strategy to speed up processing.
Explore the package future
There is a package called future
and introduces some new concepts in R, including futures. In this way it breaks the linearity in the execution of the code and allows to exploit the parallelism in cases other than parallel
, which is the library that is at the base of the SIMD approach in R.
This function produces a list with two vectors, one of colSums()
and another one of rowSums()
. When using the future({}) %plan% multiprocess
scheme and then value()
each function should be executed on a separate thread.
library(future)
media_columnas_filas <- function(x) {
columnas <- future({colMeans(x)}) %plan% multiprocess
filas <- future({rowMeans(x)}) %plan% multiprocess
list(filas_final = value(columnas),
columnas_final = value(filas))
}
In the tests I did, it runs a bit slower than with
list (columnas = colMeans(x),
filas = rowMeans(x))
But you could try other (larger) data, other hardware or other functions. In fact, in some tests with more nested functions, it runs a bit faster than the synchronous version.