# Sum of a category within a data.frame

1

From a data.frame with the following characteristics:

``````df <- data.frame(mes= rep(10:12, each = 4)
, hour = rep(1:2, 6)
, value = sample(1:12))

df

> df
mes hour value
1   10    1     6
2   10    2     9
3   10    1    10
4   10    2    12
5   11    1     7
6   11    2     2
7   11    1    11
8   11    2     3
9   12    1     5
10  12    2     1
11  12    1     8
12  12    2     4
``````

I would like to generate another data.frame with the accumulated sums of "value" per hour in each of the months something like this:

``````>df2
mes hour value
1   10    1    16
2   10    2    18
3   11    1    18
4   11    2     5
5   12    1    13
6   12    2     5
``````

Any ideas on how I can achieve it?

asked by Edgar Mora 30.11.2017 в 23:57
source

1

for this kind of situations, the `dplyr` library makes the transformation you need very easy. Simply create the data groups first (in this case, each combination of month and hour) and then make a summary, in this case with the sum of `value` for each of the groups you defined before. I use pipes to link functions, but you could formulate it in another way.

``````library(tidyverse)      #Esta librería a su vez empaqueta a 'dplyr' y otras librerías muy útiles para manipulación de datos.

tribble(
~mes, ~hour, ~value,
10,    1,     6,
10,    2,     9,
10,    1,    10,
10,    2,    12,
11,    1,     7,
11,    2,     2,
11,    1,    11,
11,    2,     3,
12,    1,     5,
12,    2,     1,
12,    1,     8,
12,    2,     4) ->datos     #Creo la estructura de datos y le asigno el nombre datos.
``````

With this I already have the data object in my environment.

``````datos %>%                      #La tubería pasa la función siguiente (lo llamamos lado derecho) el output de la anterior (lado izquierdo). En este caso pasa el objeto datos.
group_by(mes, hour) %>%        #Agrupo por mes y hour, si miras el output de esta función verás que no cambia nada en los datos, pero agrega atributos de grupos que usará la siguiente función.
summarise(suma=sum(value))   #Crea un sumario con la sumatoria de datos, lo interesante es que la sumatoria es para cada uno de los grupos que definí antes.
``````

and I get:

``````# A tibble: 6 x 3
# Groups:   mes [?]
mes  hour  suma
<dbl> <dbl> <dbl>
10     1    16
10     2    21
11     1    18
11     2     5
12     1    13
12     2     5
``````

After the last function you could use the allocator on the right `->` to name your output, if that is what you need. Or use the traditional allocator `<-` at the beginning of the entire chain. The result is the same: a `data.frame` . This is important because if you want to continue doing operations you have a very manageable data structure.

# Other uses of `summarise()`

With `group_by()` and `summarise()` it is very easy to make summaries of the data, provided that they have the correct structure. For example, see the average value in each month:

``````datos %>%
group_by(mes) %>%
summarise(promedio=mean(value))
``````

Or identify the highest value for one hour in each month:

``````datos %>%
group_by(mes, hour) %>%
summarise(maximo=max(value))
``````

source
1

As an alternative to the good response of mpaladino, I can not help but point out the solution using R base :

``````aggregate(value~mes+hour, data=df, FUN=sum)
``````

Explanation:

• `aggregate()` is the basic function of R to arm groups of a `data.frame` and apply multiple routines to each group
• It has many ways to be invoked, this, using the formula `value~mes+hour` is optimal because it returns a `data.frame` similar to the one you are looking for
• On each group we apply the function `sum()`

The only issue I see, is that the output by default does not have the order you would expect, but eventually we could sort the grouping by an intermediate object, or even, easier, adjust the call a bit to `aggregate()` :

``````> df
mes hour value
1   10    1     8
2   10    2     5
3   10    1     4
4   10    2    11
5   11    1     2
6   11    2     3
7   11    1    12
8   11    2    10
9   12    1     6
10  12    2     7
11  12    1     9
12  12    2     1

> aggregate(value~hour+mes, data=df, FUN=sum)[,c(2,1,3)]
mes hour value
1  10    1    12
2  10    2    16
3  11    1    14
4  11    2    13
5  12    1    15
6  12    2     8
``````