Operate media and graph in R from a csv with data listed

1

I have a data file that, minimized (the original file contains more rounds and levels of "Content"), responds more or less to the following structure ( data ):

Ronda  Contenido  Salida
1        0        [1,1,1,1]
1        0        [1,1,1,1]
1        0.1      [1,1,1,1]
1        0.1      [1,1,1,1]
1        0.2      [1,1,1,1]
1        0.2      [1,1,1,1]
2        0        [1,0,2,1]
2        0        [2,0,1,1]
2        0.1      [2,1,1,0]
2        0.1      [2,1,1,0]
2        0.2      [3,1,0,0]
2        0.2      [4,0,0,0]

I would like to (learn to) obtain the average and confidence interval at 95% of the first value of the list in the "Output" column, for each combination of Round and Content. In the example we would get something like this:

Ronda    Contenido   Media Salida
1        0           1
1        0.1         1
1        0.2         1
2        0           1.5
2        0.1         2
2        0.2         3.5

The idea is to generate a graph that represents "Round" on the x axis, "Average + CI95% Exit" on the y axis, for each value of "Content". I had thought to use a code similar to the one that follows, but of course, the treatment of the "Output" column must be different:

y <- datos$Salida
z <- datos$Contenido
g <- datos$Ronda
z = factor(z, levels = c(0, 0.1, 0.2))
data <- data.frame(y,z,g)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(scales) 
data %>%
  group_by(g, z) %>%
  summarise(media = mean(y), 
            desvio = sd(y),                             #Estimación de la media
            error_est = desvio / sqrt(n()),             #Error estandar de la estimación de la media. 
            intervalo_sup = media + (2*error_est),      #Techo del intervalo. 
            intervalo_inf = media - (2*error_est)) %>%  #Piso del intervalo al 95%.
  ggplot(aes(x = g, y = media, color = clave)) +
  labs(title=mytitle1) +
  geom_point() +                                        #Para que genere una salida gráfica cuando sólo hay un data point.
  geom_line(aes(group = clave), size=1) +                       #Las líneas que unen los puntos de cada grupos xz
  geom_errorbar(aes(ymax = intervalo_sup,               #Intervalor al 95% para cada punto. 
                    ymin = intervalo_inf),
                width=0.3) + 
  #theme_minimal() +
  labs(x = "Round", y = "Mean+CI", color = "Model") +
  scale_color_manual(labels = c("0", "0.1","0.2"), values = c("blue","red","purple")) +
  theme(legend.position="bottom", legend.text=element_text(size=12)) +
  theme(axis.text=element_text(size=14),
        axis.title=element_text(size=14))

Note: The Output column appears like this in the dataframe because it is integer lists created with Python.

Thanks for the help.

    
asked by pyring 12.09.2018 в 20:06
source

2 answers

2

Your CSV file is not entirely friendly to use directly, the Salida column seems to be a list of values. The first thing we can do is import it and study the column more in detail:

data <- read.csv("C:/Tmp/data.csv", stringsAsFactors=FALSE)

# Contamos cantidad de , 
n <- unique(lengths(regmatches(data$Salida, gregexpr(",",  data$Salida)))) + 1
n

[1] 8

We have imported the file as it is, then we count the amount of , that all observations of Salida have and we see if the column is heterogeneous in terms of the number of values. I can verify that all observations of Salida have 8 possible values, which paves the way for "expanding" the strings separated by commas in real columns.

# Primero quitamos ambos corchetes
data$Salida <- gsub('\[|\]', '', data$Salida)

# Separamos los 8 valores posibles de Salida en una matrix de nro.filas x 8
m <- matrix(as.integer(unlist(strsplit(data$Salida, "\,"))), byrow= T, ncol=n)
colnames(m) <- paste0("V",1:n)

# Combinamos la matriz al data.frame original
data <- cbind(data, m)
data <- data[, -4] # Borramos la columna Salida

head(data)

  X Ronda Contenido V1 V2 V3 V4 V5 V6 V7 V8
1 1     1         0  1  1  1  1  1  1  1  1
2 2     1         0  1  1  1  1  1  1  1  1
3 3     1         0  1  1  1  1  1  1  1  1
4 4     2         0  2  1  2  0  1  0  1  1
5 5     2         0  1  1  0  2  2  0  0  2
6 6     2         0  1  1  1  1  2  0  1  1

As you can see, we have transformed the output column into 8 columns called V1..8 , now, using the first or any other is trivial.

All of the above, using base R, but if you have tidyverse it becomes incredibly simpler and clearer:

library(tidyverse)

data %>%
    mutate(Salida = gsub('\[|\]', '', Salida)) %>% # Quitamos corchetes
    separate(Salida, into=paste0("V",1:8))           # Separamos Salida en columnas

NOTE : All this if you want to solve it with R , the truth is that in this example, where the amount of values is always fixed, you could use any other tool replacement to remove the brackets [] with which you would end up being a simple CSV file perfectly importable.

    
answered by 13.09.2018 / 16:18
source
2
  

Although with the data it is no longer relevant, I keep the answer because it serves another problem: extracting elements from a list in a data.frame.

Something like that will be your data?

data.frame(Ronda = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), 
           Contenido = c(0, 0, 0.1, 0.1, 0.2, 0.2, 0, 0, 0.1, 0.1, 0.2, 0.2)) -> foo
foo$Salida <- list (c (1,1,1,1), c (1,1,1,1), c (1,1,1,1), c (1,1,1,1), c (1,1,1,1), c (1,1,1,1), c (1,0,2,1), c (2,0,1,1), c (2,1,1,0), c (2,1,1,0), c (3,1,0,0), c (4,0,0,0))'

So:

str(foo)
'data.frame':   12 obs. of  3 variables:
  $ Ronda : num  1 1 1 1 1 1 2 2 2 2 ...
$ Conte : num  0 0 0.1 0.1 0.2 0.2 0 0 0.1 0.1 ...
$ Salida:List of 12
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 0 2 1
..$ : num  2 0 1 1
..$ : num  2 1 1 0
..$ : num  2 1 1 0
..$ : num  3 1 0 0
..$ : num  4 0 0 0

It would be very important that you follow the recommendation of @Patricio Moracho and upload an example of your data. If outputs is a list of lists instead of a list of vectors (as in foo ) the solution might not work. If your data has this structure you could create a new column with the first element of each list of Salida with:

foo$salida1 <- sapply(foo$Salida, '[', 1)
  

sapply() passes a function through a list and foo$Salida is a list. Inside a data.frame, but list at the end. Being sapply() simplifies the output, converting it into a vector.   '[' is a subsetting function that extracts values by position, in this case the value with index 1.

With these data you could do:

  library(tidyverse)
  foo %>% 
  group_by(Ronda, Contenido) %>% 
  summarise(media = mean(salida1), 
            desvio = sd(salida1),
            error_est = desvio / sqrt(n()),             
            intervalo_sup = media + (2*error_est),      
            intervalo_inf = media - (2*error_est)) %>% 
  ungroup %>% 
  mutate(Ronda = as.factor(Ronda), Contenido = as.factor(Contenido)) %>% 
  ggplot(aes(x = Ronda, y = media, color = Contenido)) +   
  geom_point(position = position_dodge(0.3)) +                       
  geom_errorbar(aes(ymax = intervalo_sup,            
                    ymin = intervalo_inf),                      
                position=position_dodge(0.3), 
                width = 0.3)

What it produces:

    
answered by 13.09.2018 в 01:19