Parallel 2 functions for 1 data.frame

2

I've been looking for information about parallelizing processes in R and I've been messing with the "parallel", "foreach" and "doParallel" packages but I have not got the results I wanted. I would like to be able to do 2 functions at the same time to reduce computing time, since I am only using a processor core and it takes me a long time to run the code.

I give you an example, I have the following data frame:

data= data.frame(V1=rnorm(1000),V2=rnorm(1000),V3=rnorm(1000))

And I want to do the following functions in parallel:

colMeans(data)
colSums(data)

The best thing (I think) would be that each core will work with 1 function but I do not know if processes like this always have to work like this.

Thank you very much for the help.

    
asked by Void 16.08.2018 в 10:39
source

1 answer

2

The problem

The packages with which you are trying to do manage to execute the code in multiple threads doing SIMD (Single Instruction, Multiple Data). That's why they're appropriate when you're going to use the same function in each part of a data structure. A typical example of use: pass the same function to each element of the list x and present the result as a list with the same length as x. Using %dopar% the function is passed through more than one element at the same time and (in some cases) the execution time is reduced.

The important thing is that they are different data (eg each column of a data.frame) and the same function.

In your case what you are looking for is the opposite, same data, two different functions.

I do not think you're going to have the results you're looking for with the packages you're using, because they're for something else.

In your case, what interests you is doing MISD

Possible solutions

  

Note: parallel execution has the potential to shorten the execution time, but this is not always the case. When running in parallel we have an overhead: you have to open the threads and then gather the result. If the functions you are running do not take long, overhead losses may be higher than the parallelism gain. In that case, it's not worth parallelizing.

Apply each function in parallel

One possible solution is to not use colMeans and rowMeans and do a parallel iterator. If you are using Mac or Linux it is very easy with the function family mc*apply . If you use Windows with foreach you could do something like this:

foo <- data.frame(V1=rnorm(1000),V2=rnorm(1000),V3=rnorm(1000))

library(foreach)
library(doParallel)

# Registro un cluster con dos hilos. 
registerDoParallel(makeCluster(2))

foreach(col = seq_along(foo)) %dopar% mean(foo[,col])

Here you would be running the function mean() on each column in two threads. In this particular case, with three columns and knowing that colMeans is optimized and mean() no, surely the result will be slower.

You would still not be able to run the means of columns and rows at the same time, but if instead of mean() you are working with a function with a long runtime you could use this strategy to speed up processing.

Explore the package future

There is a package called future and introduces some new concepts in R, including futures. In this way it breaks the linearity in the execution of the code and allows to exploit the parallelism in cases other than parallel , which is the library that is at the base of the SIMD approach in R.

This function produces a list with two vectors, one of colSums() and another one of rowSums() . When using the future({}) %plan% multiprocess scheme and then value() each function should be executed on a separate thread.

library(future)
media_columnas_filas <- function(x) {
  columnas <- future({colMeans(x)}) %plan% multiprocess
  filas <- future({rowMeans(x)}) %plan% multiprocess
  list(filas_final = value(columnas), 
       columnas_final = value(filas))
  }

In the tests I did, it runs a bit slower than with

list (columnas = colMeans(x), 
         filas = rowMeans(x))  

But you could try other (larger) data, other hardware or other functions. In fact, in some tests with more nested functions, it runs a bit faster than the synchronous version.

    
answered by 16.08.2018 / 19:25
source