How do I format rows (observations), columns (variables) to this file?


I'm thinking how to give the typical dataset format to the 1 next document excell, the only option that has occurred to me is transpose manually in excell, some idea in Excell or R? The idea is to have in each row a neighborhood or district of Madrid in the columns the different variables. (it is the dataset of the link in excell format "Study of indicators of districts and neighborhoods 2017")

asked by agustin marin 27.11.2018 в 22:46

1 answer


I assume you want each row to be a territorial unit (district, neighborhood, etc.) and each column an attribute or variable measured in that territorial unit. This would be (approximately) tidydata, in Wickham's sense. If you are looking for another structure for the data you should edit your question to make it explicit.

There is a first approach to the problem, which at least advances the work somewhat by producing a transposed data.frame and with reasonably appropriate names. Definitely requires more cleaning, better names, etc. Clearing Excel files is very complicated because each one has the structure given by who did it and an ad hoc solution is always necessary, so you have a few hours ahead;)


  • I had to transform the file out of R to xlsx, I could not read it as .xls
  • The script works with the first excel sheet (CENTER). However if works well and -mainly- if all the sheets have the same structure, you could reproduce it on each sheet and then merge each part into a complete data.frame.
  • Code

    The most important function is transpose of data.table . The rest is handling row and column names.

    datos <- read_excel("excel_maldito/IndicadoresDistritos2017_2.xlsx", sheet = "CENTRO", skip = 1) #Importante skip = 1 para mantener nombres de columna.
    datos <- rbind(names(datos), datos) #Paso los nombres de columna como primera fila, si no transpose los pierde.
    datos_t <- data.table::transpose(datos) #Transpongo los datos: ahora las columnas son filas.
    colnames(datos_t) <- datos_t[1, ]      #Paso primera fila a nombres de columna, de este modo los conservé
    datos_t <- datos_t[2:nrow(datos_t),]   #Y elimino esa primera fila, ahora es redundante
    datos_t_limpio <- datos_t[ ,!sapply(lapply(datos_t,, all)]  #Elimino todas las columnas que solo contienen NA.
    names(datos_t_limpio)[307] <- "Capítulo Presupuestario"  #En en los datos produce un NA
    datos_t_limpio <- datos_t_limpio[1:16, ] #ATENCION: podría fallar en otras horas con más o menos unidades territoriales


    • Excel has absolute values and proportions. A decision should be made about what to do with it. If the proportions can be calculated from the absolute values they could be eliminated.

    • I'm not sure that the budget data has the same structure as the rest. It should be verified and deleted / reformatted depending on the case.

    • I'm not sure that they paint the rows X__10 to X__12 . I think they are repeated or have irrelevant information, that's why I eliminate them, but I would have to verify it carefully and see what happens in the remaining sheets that may have more or less neighborhoods .

    Then you tell me if you approach the solution. Clearing excels is painful, luck and patience.

    answered by 28.11.2018 в 18:31