Alternative to nested lapply

Question

Alternative to nested lapply

Navigation

#1 by (2 votes)
#2 by (1 votes)

1

I open the question to have a reference of the available options, in Spanish.

Suppose we have a data set like the following:

set.seed(2018)
mi_df <- data.frame(stringsAsFactors = F,
                    var1 = rnorm(50),
                    "grupo1" = sample(c("a", "b", "c"), 50, TRUE),
                    "grupo2" = sample(c("x", "y", "z"), 50, TRUE))

The goal is to obtain the average of var1 using mean , for each of the combinations of group1 and group2 (ax, bx, cx, ay, by, etc.).

One way is to nest lapply , so that we apply the function mean for each combination. The first lapply will pass through the values of group1 and the second one, nested, by those in group2

lapply(unique(mi_df[["grupo1"]]), function(x) {
  lapply(unique(mi_df[["grupo2"]]), function(y) {
    subconjunto <- subset(mi_df, grupo1 == x & grupo2 == y)
    mean(subconjunto[["var1"]])
  })
})

The result is a list with nine results (actually, a list of three lists, with three results inside each one).

The above can become confusing with more complex problems.

How could I get the same nine results, but without nesting lapply ?

r

asked by Juan Bosco 02.02.2018 в 17:34

source

2 answers

1

I think the simplest way is "grouping" using aggregate in the following way:

aggregate( var1 ~ grupo1 + grupo2 , mi_df, mean)

What is done is to group by grupo1 and grupo2 and apply mean() on the rows of each of them. The exit:

  grupo1 grupo2         var1
1      a      x -0.191475371
2      b      x  0.738641761
3      c      x  0.154478780
4      a      y -0.001940531
5      b      y -0.598257655
6      c      y  1.507863033
7      a      z -0.045557242
8      b      z -0.404402472
9      c      z -0.001638198

answered by 02.02.2018 в 17:51

Does anyone know what is the cause of the syntax error? Operator overload! with class Visibility c #

score 2 · Accepted Answer

An alternative using tidyverse.

set.seed(2018)
mi_df <- data.frame(stringsAsFactors = F, var1 = rnorm(50), grupo1 = sample(c("a", 
  "b", "c"), 50, TRUE), grupo2 = sample(c("x", "y", "z"), 50, TRUE))



library(tidyverse)

mi_df %>% group_by(grupo1, grupo2) %>% summarise(var1 = mean(var1))
#> # A tibble: 9 x 3
#> # Groups:   grupo1 [?]
#>   grupo1 grupo2     var1
#>   <chr>  <chr>     <dbl>
#> 1 a      x      -0.191  
#> 2 a      y      -0.00194
#> 3 a      z      -0.0456 
#> 4 b      x       0.739  
#> 5 b      y      -0.598  
#> 6 b      z      -0.404  
#> 7 c      x       0.154  
#> 8 c      y       1.51   
#> 9 c      z      -0.00164

Why did you go? Imagine you have more variables, and more variables by which to group, and not only do you want the mean, but also variance (and others). You can do it by doing

library(tidyverse)
data(mtcars)
mtcars2 <- select(mtcars, cyl, vs, am, mpg, hp, wt)
head(mtcars2)
#>                   cyl vs am  mpg  hp    wt
#> Mazda RX4           6  0  1 21.0 110 2.620
#> Mazda RX4 Wag       6  0  1 21.0 110 2.875
#> Datsun 710          4  1  1 22.8  93 2.320
#> Hornet 4 Drive      6  1  0 21.4 110 3.215
#> Hornet Sportabout   8  0  0 18.7 175 3.440
#> Valiant             6  1  0 18.1 105 3.460

mtcars2 %>%
  group_by(cyl, vs, am) %>%
  summarise_all(.funs = list(media = mean, varianza = var, minimo = min))
#> # A tibble: 7 x 12
#> # Groups:   cyl, vs [?]
#>     cyl    vs    am mpg_media hp_media wt_media mpg_varianza hp_varianza
#>   <dbl> <dbl> <dbl>     <dbl>    <dbl>    <dbl>        <dbl>       <dbl>
#> 1  4.00  0     1.00      26.0     91.0     2.14       NA            NA  
#> 2  4.00  1.00  0         22.9     84.7     2.94        2.11        386  
#> 3  4.00  1.00  1.00      28.4     80.6     2.03       22.6         583  
#> 4  6.00  0     1.00      20.6    132       2.76        0.563      1408  
#> 5  6.00  1.00  0         19.1    115       3.39        2.66         84.2
#> 6  8.00  0     0         15.0    194       4.10        7.70       1113  
#> 7  8.00  0     1.00      15.4    300       3.37        0.320      2520  
#> # ... with 4 more variables: wt_varianza <dbl>, mpg_minimo <dbl>,
#> #   hp_minimo <dbl>, wt_minimo <dbl>