I have a data file that, minimized (the original file contains more rounds and levels of "Content"), responds more or less to the following structure ( data ):

Ronda  Contenido  Salida
1        0        [1,1,1,1]
1        0        [1,1,1,1]
1        0.1      [1,1,1,1]
1        0.1      [1,1,1,1]
1        0.2      [1,1,1,1]
1        0.2      [1,1,1,1]
2        0        [1,0,2,1]
2        0        [2,0,1,1]
2        0.1      [2,1,1,0]
2        0.1      [2,1,1,0]
2        0.2      [3,1,0,0]
2        0.2      [4,0,0,0]

I would like to (learn to) obtain the average and confidence interval at 95% of the first value of the list in the "Output" column, for each combination of Round and Content. In the example we would get something like this:

Ronda    Contenido   Media Salida
1        0           1
1        0.1         1
1        0.2         1
2        0           1.5
2        0.1         2
2        0.2         3.5

The idea is to generate a graph that represents "Round" on the x axis, "Average + CI95% Exit" on the y axis, for each value of "Content". I had thought to use a code similar to the one that follows, but of course, the treatment of the "Output" column must be different:

y <- datos$Salida
z <- datos$Contenido
g <- datos$Ronda
z = factor(z, levels = c(0, 0.1, 0.2))
data <- data.frame(y,z,g)
data %>%
  group_by(g, z) %>%
  summarise(media = mean(y), 
            desvio = sd(y),                             #Estimación de la media
            error_est = desvio / sqrt(n()),             #Error estandar de la estimación de la media. 
            intervalo_sup = media + (2*error_est),      #Techo del intervalo. 
            intervalo_inf = media - (2*error_est)) %>%  #Piso del intervalo al 95%.
  ggplot(aes(x = g, y = media, color = clave)) +
  labs(title=mytitle1) +
  geom_point() +                                        #Para que genere una salida gráfica cuando sólo hay un data point.
  geom_line(aes(group = clave), size=1) +                       #Las líneas que unen los puntos de cada grupos xz
  geom_errorbar(aes(ymax = intervalo_sup,               #Intervalor al 95% para cada punto. 
                    ymin = intervalo_inf),
                width=0.3) + 
  #theme_minimal() +
  labs(x = "Round", y = "Mean+CI", color = "Model") +
  scale_color_manual(labels = c("0", "0.1","0.2"), values = c("blue","red","purple")) +
  theme(legend.position="bottom", legend.text=element_text(size=12)) +

Note: The Output column appears like this in the dataframe because it is integer lists created with Python.

Thanks for the help.

Your CSV file is not entirely friendly to use directly, the Salida column seems to be a list of values. The first thing we can do is import it and study the column more in detail:

data <- read.csv("C:/Tmp/data.csv", stringsAsFactors=FALSE)

# Contamos cantidad de , 
n <- unique(lengths(regmatches(data$Salida, gregexpr(",",  data$Salida)))) + 1

[1] 8

We have imported the file as it is, then we count the amount of , that all observations of Salida have and we see if the column is heterogeneous in terms of the number of values. I can verify that all observations of Salida have 8 possible values, which paves the way for "expanding" the strings separated by commas in real columns.

# Primero quitamos ambos corchetes
data$Salida <- gsub('\[|\]', '', data$Salida)

# Separamos los 8 valores posibles de Salida en una matrix de nro.filas x 8
m <- matrix(as.integer(unlist(strsplit(data$Salida, "\,"))), byrow= T, ncol=n)
colnames(m) <- paste0("V",1:n)

# Combinamos la matriz al data.frame original
data <- cbind(data, m)
data <- data[, -4] # Borramos la columna Salida


  X Ronda Contenido V1 V2 V3 V4 V5 V6 V7 V8
1 1     1         0  1  1  1  1  1  1  1  1
2 2     1         0  1  1  1  1  1  1  1  1
3 3     1         0  1  1  1  1  1  1  1  1
4 4     2         0  2  1  2  0  1  0  1  1
5 5     2         0  1  1  0  2  2  0  0  2
6 6     2         0  1  1  1  1  2  0  1  1

As you can see, we have transformed the output column into 8 columns called V1..8 , now, using the first or any other is trivial.

All of the above, using base R, but if you have tidyverse it becomes incredibly simpler and clearer:


data %>%
    mutate(Salida = gsub('\[|\]', '', Salida)) %>% # Quitamos corchetes
    separate(Salida, into=paste0("V",1:8))           # Separamos Salida en columnas

NOTE : All this if you want to solve it with R , the truth is that in this example, where the amount of values is always fixed, you could use any other tool replacement to remove the brackets [] with which you would end up being a simple CSV file perfectly importable.

Although with the data it is no longer relevant, I keep the answer because it serves another problem: extracting elements from a list in a data.frame.

Something like that will be your data?

data.frame(Ronda = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), 
           Contenido = c(0, 0, 0.1, 0.1, 0.2, 0.2, 0, 0, 0.1, 0.1, 0.2, 0.2)) -> foo
foo$Salida <- list (c (1,1,1,1), c (1,1,1,1), c (1,1,1,1), c (1,1,1,1), c (1,1,1,1), c (1,1,1,1), c (1,0,2,1), c (2,0,1,1), c (2,1,1,0), c (2,1,1,0), c (3,1,0,0), c (4,0,0,0))'


'data.frame':   12 obs. of  3 variables:
  $ Ronda : num  1 1 1 1 1 1 2 2 2 2 ...
$ Conte : num  0 0 0.1 0.1 0.2 0.2 0 0 0.1 0.1 ...
$ Salida:List of 12
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 1 1 1
..$ : num  1 0 2 1
..$ : num  2 0 1 1
..$ : num  2 1 1 0
..$ : num  2 1 1 0
..$ : num  3 1 0 0
..$ : num  4 0 0 0

It would be very important that you follow the recommendation of @Patricio Moracho and upload an example of your data. If outputs is a list of lists instead of a list of vectors (as in foo ) the solution might not work. If your data has this structure you could create a new column with the first element of each list of Salida with:

foo$salida1 <- sapply(foo$Salida, '[', 1)

sapply() passes a function through a list and foo$Salida is a list. Inside a data.frame, but list at the end. Being sapply() simplifies the output, converting it into a vector.   '[' is a subsetting function that extracts values by position, in this case the value with index 1.

With these data you could do:

  foo %>% 
  group_by(Ronda, Contenido) %>% 
  summarise(media = mean(salida1), 
            desvio = sd(salida1),
            error_est = desvio / sqrt(n()),             
            intervalo_sup = media + (2*error_est),      
            intervalo_inf = media - (2*error_est)) %>% 
  ungroup %>% 
  mutate(Ronda = as.factor(Ronda), Contenido = as.factor(Contenido)) %>% 
  ggplot(aes(x = Ronda, y = media, color = Contenido)) +   
  geom_point(position = position_dodge(0.3)) +                       
  geom_errorbar(aes(ymax = intervalo_sup,            
                    ymin = intervalo_inf),                      
                width = 0.3)

What it produces:

