Problems
The table is not in direct html, the html of the table is created through javascript directly in the browser, then the method of reading the url with read_html()
does not work. In other words, you would have to use a browser / robot as an intermediary to download the page, render and read the embedded javascript code and then read the result with read_html()
.
The table is not a standard html table, the header of the table (the column names) of the body is separate. One is at node thead
(which is empty of content) and the rest at node tbody
. If you ask me I think that the owners of the page do not like scraping very much, although in its terms of use they do not explicitly prohibit it. Anyway I suggest you check it to make sure that what you want to do is not against the policy of the site. Basically it depends on the use that you are going to give to the data, they can be downloaded, but not publicly reproduced.
Solutions
In this question there are more details and a couple of approaches to that problem. The answer of @Patricio Moracho is the one that works in this case, with phantomjs I could not make it walk. I'm going to paste a code that worked for me to solve this step, without many explanations.
library(RSelenium) # Está de nuevo en CRAN!
rD <- rsDriver() #Copipasta del código de Patricio en la pregunta enlazada. Funciona, pero no sé bien como o por que.
remDr <- rD$client
remDr$open()
url <- "https://stats.nba.com/teams/advanced/?sort=OFF_RATING&dir=-1&CF=MIN*GE*15&Season=2018-19&SeasonType=Regular%20Season"
remDr$navigate(url)
doc <- remDr$getPageSource()[[1]]
page <- read_html(doc)
With this we get to an object of the class xml_node
that contains the table instead of a message that indicates that the table is being loaded.
The table is NOT standard, so it is not possible to use (at least I could not) html_table
to get a well-formatted data.frame easily. The alternative that I found is:
- Identify and extract the node before the table (
nba-stat-table
)
- Identify and extract the node with the body of the table (
tbody
)
- Extract a list with all table row nodes (
tr
)
- Extract the text of those nodes in a vector. It does not have the names well and does not have a table structure, but there is the information.
- Pass that vector to an array of 30 columns, one per team.
- Pass it to data.frame.
- Manually paste the 18 variable names.
- Pass it long so there is not one column per team (tidydata)
Inserted text to cut the previous list that breaks the rule of four spaces for the code.
page %>%
html_node("nba-stat-table") %>%
html_node("tbody") %>%
html_nodes("td") %>%
html_text() %>%
matrix(ncol = 30) %>%
data.frame(stringsAsFactors = F) -> datos
as.character(datos[2,]) -> nombres
datos %>%
set_names(nombres) %>%
slice(3:ncol(datos)) %>%
mutate(dato = c("GP", "W", "L", "MIN", "OffRtg", "DefRtg", "NetRtg", "AST%", "AST/TO", "AST_Ratio", "OREB%", "DREB%", "REB%", "TOV%", "eFG%", "TS%", "PACE", "PIE")) %>% #Nombres a mano... muy feo
gather(equipo, valor, -dato) -> datos_limpios
head(datos_limpios)
dato equipo valor
1 GP Milwaukee Bucks 22
2 W Milwaukee Bucks 15
3 L Milwaukee Bucks 7
4 MIN Milwaukee Bucks 1066.0
5 OffRtg Milwaukee Bucks 115.5
6 DefRtg Milwaukee Bucks 106.0
With this we get some long data that with 'spread (data_clean, data, value) pass tidydata: each row is a team and each column an attribute of that team.
As you will see it works, but it has a lot of non-parametric ad hoc code.
- the number of columns
- The number and names of the variables. This is the most complicated to solve and it is given because the site separated the table into two parts. For visual effects in the browser does not generate problem, for scraping itself.
I hope it helps you.
Added: code to extract column names from thead
page %>%
html_nodes(xpath = '//th[not(@hidden="") and not(@data-field="TEAM_NAME") and @sort=""]') %>% #Hay columnas escondidas y otras inncesarias, este xpath matchea solamente las informativas.
html_text() -> nombres_columna