Read table (HTML) with rvest

1

I am trying to read the table of the history of the monthly minimum wage of Colombia, the Bank of the Republic publishes the history in the following link Minimum legal salary in Colombia but I can not get the correct reading of the table. Any solution thanks.

library(rvest)
url <- "http://obieebr.banrep.gov.co/analytics/saw.dll?Go&Action=prompt&lang=es&NQUser=publico&NQPassword=publico&path=%2Fshared%2fSeries%20Estad%C3%ADsticas_T%2F1.%20Salarios%2F1.1%20Salario%20m%C3%ADnimo%20legal%20en%20Colombia%2F1.1.1.SLR_Serie%20hist%C3%B3rica&Options=rdf"
archivo <- read_html(url)
tablas <- html_nodes(archivo, "table")

tabla1 <- html_table(tablas[1], fill = TRUE, header = T, dec = ",")
datos <- as.data.frame(tabla1)
head(datos)

My result

[1] Var.1  Banco.de.la.República
<0 rows> (or 0-length row.names)

Expected result

1   1984    376.6  11298               0                         N/A 
2   1985   451.92  13558              20       0001 de enero de 1985 
3   1986   560.38  16811              24   3754 de diciembre de 1985 
4   1987   683.66  20510              22   3732 de diciembre de 1986 
...  ...      ...    ...             ...                         <NA>
32  2015 21478.33 644350             4.6 2731 de diciembre 30 de 2014
33  2016 22981.83 689455               7 2552 de diciembre 30 de 2015
34  2017 24590.56 737717               7 2209 de diciembre 30 de 2016
35  2018  26041.4 781242             5.9 2269 de diciembre 30 de 2017
    
asked by Rafael Díaz 23.07.2018 в 02:13
source

2 answers

2

The diagnosis of @mpaladino seems to be correct, the page uses javascript code to "draw" the data table, execute read_html() will only return the static code, without the data dynamically loaded. If you can not use the links to the download offered by the page, you need to attack the problem in another way. One possibility is to use RSelenium, a scrapping engine that conceptually interacts with the page as a user would, opening an instance of a browser, so that we can capture the final HTML code after the execution of the javascript code.

require(RSelenium)
library(rvest)

# Se descarga automáticamente un chrome browser propio
# Esto puede demorar bastante
rD <- rsDriver()
remDr <- rD$client
remDr$open()
url <- "http://obieebr.banrep.gov.co/analytics/saw.dll?Go&Action=prompt&lang=es&NQUser=publico&NQPassword=publico&path=%2Fshared%2fSeries%20Estad%C3%ADsticas_T%2F1.%20Salarios%2F1.1%20Salario%20m%C3%ADnimo%20legal%20en%20Colombia%2F1.1.1.SLR_Serie%20hist%C3%B3rica&Options=rdf"
remDr$navigate(url)

# Recién cuando la página se ha cargado completamente
# Se libera la ejecución, y entonces podremos leer el HTML
doc <- remDr$getPageSource()[[1]]
page <- read_html(doc)
node <- html_nodes(page, xpath = '//td[@class="PTChildPivotTable"]') 
table <- xml_node(node, "table")

# Corregimos varios temas de la tabla capturada
datos <- as.data.frame(html_table(table, fill = TRUE, header = T, dec = ","))
colnames(datos) <- datos[1:5,1]
datos <- datos[-c(1:5),1:5]

# Ejemplo de los datos
head(datos)

    Año Salario mínimo diario Salario mínimo mensual Variación porcentual anual ¹ Decretos del Gobierno Nacional
6  1984              $ 376,60            $ 11.298,00                        0,00%                            N/A
7  1985              $ 451,92            $ 13.558,00                       20,00%          0001 de enero de 1985
8  1986              $ 560,38            $ 16.811,00                       24,00%      3754 de diciembre de 1985
9  1987              $ 683,66            $ 20.510,00                       22,00%      3732 de diciembre de 1986
10 1988              $ 854,58            $ 25.637,00                       25,00%      2545 de diciembre de 1987
11 1989            $ 1.085,32            $ 32.560,00                       27,00%      2662 de diciembre de 1988
    
answered by 23.07.2018 / 20:15
source
2

Web scrapping of tables generated with javascript

I add an alternative to Rselenium , that due to problems with some dependencies the package left CRAN and is more difficult to install. In this alternative, instead of selenium we use phantomjs , an "invisible browser" that executes the code javascript of the page and saves the result as a static html that we can later read and manipulate from R.

phantomjs

It is necessary to have phantomjs installed. The installation is very easy, it only consists of copying the binary to the location from which we want to execute it. In linux we copy it to /usr/local/bin/ so that it is directly in the PATH. It is also necessary to give executable privileges.

script javascript

phantomjs just "eat" javascript, so we have to pass a script in that language with the instructions. I adapted the one that follows from the example of link

// scrapping_banco_colombia.js
// adaptado de https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/


// en la siguiente línea cargamos la dirección web que queremos leer
var url ='http://obieebr.banrep.gov.co/analytics/saw.dll?Go&Action=prompt&lang=es&NQUser=publico&NQPassword=publico&path=%2Fshared%2fSeries%20Estad%C3%ADsticas_T%2F1.%20Salarios%2F1.1%20Salario%20m%C3%ADnimo%20legal%20en%20Colombia%2F1.1.1.SLR_Serie%20hist%C3%B3rica&Options=rdf';
var page = new WebPage()
var fs = require('fs');


page.open(url, function (status) {
        just_wait();
});

function just_wait() {
    setTimeout(function() {
//en la siguiente línea se puede especificar el nombre de archivo.      
               fs.write('datos.html', page.content, 'w');
            phantom.exit();
//algunas páginas muestran un mensaje de progreso u otr información antes de mostrar los datos, para eso se una este timeout, para que phanthom espere hasta que esté el resultado final. 
// está en milisegundos, en este caso espera 10 segundos antes de guardar el archivo.             
    }, 10000);
}

It is necessary to save the script to a file with the .js extension and remember the location. In this case I called it scrapping_banco_colombia.js .

Generate the static html.

This is the easiest part. From the console of the operating system we pass the instruction:

pantomjs script_banco_colombia.js 

If we do not want to go to the console we can do it from R with the function system . In fact, you could create a script that dynamically creates the .js file with a different web address and file name. This would respond to a more frequent scrapping scenario, when we want to raise a lot of data with the same structure.

system("phantomjs script_banco_colombia.js")

And 10 seconds later we obtain data.html, or the name we give to the file by modifying the script .js .

  

If it does not work it is likely that phantomjs is not registered in the PATH of our operating system. It is good practice to do it.

Scrapping:

The following is more or less what Patrick's answer is written in another way:

library(rvest)
library(tidyverse)
library(stringr)

read_html("datos.html") %>%
  html_nodes( xpath = '//td[@class="PTChildPivotTable"]') %>% 
  html_nodes("table") %>% 
  html_table(fill = TRUE, header = T, dec = ",") %>% 
  .[[1]] %>%
  .[1:5] %>% 
  drop_na %>% 
  setNames(c("Año", 
             "Salario mínimo diario", 
             "Salario mínimo mensual", 
             "Variación porcentual anual", 
             "Decretos del Gobierno Nacional")) %>%  
  #Limpiar de símbolos y pasar a numérico.
mutate_all(funs(str_remove_all( ., "\$|\.|\,|\%"))) %>%  #Un regex que elimina $ . , %
mutate_all(str_trim) %>%                                     #Eliminar espacios en blanco
mutate_at(vars(-5), as.numeric)                              #Todas menos la última.

I added some character cleaning lines to be able to pass the corresponding columns to numbers.

    
answered by 24.07.2018 в 22:41