Web scrapping of tables generated with javascript
I add an alternative to Rselenium
, that due to problems with some dependencies the package left CRAN and is more difficult to install. In this alternative, instead of selenium
we use phantomjs
, an "invisible browser" that executes the code javascript
of the page and saves the result as a static html that we can later read and manipulate from R.
phantomjs
It is necessary to have phantomjs
installed. The installation is very easy, it only consists of copying the binary to the location from which we want to execute it. In linux we copy it to /usr/local/bin/
so that it is directly in the PATH. It is also necessary to give executable privileges.
script javascript
phantomjs
just "eat" javascript, so we have to pass a script in that language with the instructions. I adapted the one that follows from the example of link
// scrapping_banco_colombia.js
// adaptado de https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/
// en la siguiente línea cargamos la dirección web que queremos leer
var url ='http://obieebr.banrep.gov.co/analytics/saw.dll?Go&Action=prompt&lang=es&NQUser=publico&NQPassword=publico&path=%2Fshared%2fSeries%20Estad%C3%ADsticas_T%2F1.%20Salarios%2F1.1%20Salario%20m%C3%ADnimo%20legal%20en%20Colombia%2F1.1.1.SLR_Serie%20hist%C3%B3rica&Options=rdf';
var page = new WebPage()
var fs = require('fs');
page.open(url, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
//en la siguiente línea se puede especificar el nombre de archivo.
fs.write('datos.html', page.content, 'w');
phantom.exit();
//algunas páginas muestran un mensaje de progreso u otr información antes de mostrar los datos, para eso se una este timeout, para que phanthom espere hasta que esté el resultado final.
// está en milisegundos, en este caso espera 10 segundos antes de guardar el archivo.
}, 10000);
}
It is necessary to save the script to a file with the .js
extension and remember the location. In this case I called it scrapping_banco_colombia.js
.
Generate the static html.
This is the easiest part. From the console of the operating system we pass the instruction:
pantomjs script_banco_colombia.js
If we do not want to go to the console we can do it from R with the function system
. In fact, you could create a script that dynamically creates the .js
file with a different web address and file name. This would respond to a more frequent scrapping scenario, when we want to raise a lot of data with the same structure.
system("phantomjs script_banco_colombia.js")
And 10 seconds later we obtain data.html, or the name we give to the file by modifying the script .js
.
If it does not work it is likely that phantomjs
is not registered in the PATH of our operating system. It is good practice to do it.
Scrapping:
The following is more or less what Patrick's answer is written in another way:
library(rvest)
library(tidyverse)
library(stringr)
read_html("datos.html") %>%
html_nodes( xpath = '//td[@class="PTChildPivotTable"]') %>%
html_nodes("table") %>%
html_table(fill = TRUE, header = T, dec = ",") %>%
.[[1]] %>%
.[1:5] %>%
drop_na %>%
setNames(c("Año",
"Salario mínimo diario",
"Salario mínimo mensual",
"Variación porcentual anual",
"Decretos del Gobierno Nacional")) %>%
#Limpiar de símbolos y pasar a numérico.
mutate_all(funs(str_remove_all( ., "\$|\.|\,|\%"))) %>% #Un regex que elimina $ . , %
mutate_all(str_trim) %>% #Eliminar espacios en blanco
mutate_at(vars(-5), as.numeric) #Todas menos la última.
I added some character cleaning lines to be able to pass the corresponding columns to numbers.