In R, divide dataframe by ranges of column values. Make operations iteratively and paste results

1

I have a very simplified version of a fairly large dataframe A (50 columns and 1000000 lines):

Palabra      Frecuencia  Numero
hola(1.4)    0.15        1
amigo(1.2)   0.67        2
sol(0.3)     0.85        7
hola(7.1)    0.4         3
hola(5.1)    0.44        4

I want to do operations first by dividing A into 4 subframes according to the values in the "Frequency" column, which range between 0 and 1, grouping the values in 4 frequency baskets of size 0.25. This can be done with dplyr using "group_by". In each of the 4 subframes generated, I want to create sub-subframes that contain all the lines that contain a specific word like "hello" in the "Word" column. This I know how to do with filter and grepl. Then I want to do mathematical operations on these sub sub dataframes, such as:

op1 = mean(Numero)
op2 = nrow(Numero)

After joining op1 with op2 with cbind and performing a couple of simple operations, I generate new dataframes like this:

Palabra      op1    op2  Canasta
hola         0.42   2    [0.25, 0.5]

Finally (and this is the most important thing), I want to do this process iteratively for each basket and word, and generate a new dataframe that will stick ("append") the lines generated for the different baskets. Something like that would result:

Palabra      op1    op2  Canasta
hola         0.21   3    [0, 0.25]
amigo        0.3    5    [0, 0.25]
sol          4.2    6    [0, 0.25]
hola         0.42   2    [0.25, 0.5] # esta linea corresponde al ejemplo de sub sub dataframe
amigo        0.32   2    [0.25, 0.5]
sol          0.11   7    [0.25, 0.5]
hola         0.72   2    [0.5, 0.75] 
amigo        0.52   2    [0.5, 0.75]
sol          0.1    3    [0.5, 0.75]
hola         0.72   5    [0.75, 1] 
amigo        0.49   7    [0.75, 1]
sol          0.10   1    [0.75, 1]

I have seen that lapply can help in the iteration but it is not clear to me how in this example. Maybe there is some way more direct and easy to do it. What matters most to me is the final dataframe.

    
asked by Lucas 09.02.2017 в 00:55
source

1 answer

0

If the operations you want to do will always result in a single value (a summary measure), you can have what you want by using the function summarize of dplyr .

tu_tabla %>% 
      mutate(Canasta = cut(Frecuencia, seq(0, 1, .25), include.lowest = T, right = T)) %>% 
      group_by(Canasta, Palabra) %>% 
      summarize(op1 = mean(Numero), op2 = length(Numero))
    
answered by 28.02.2017 в 00:30