Use R to obtain data from a web

1

I'm trying to get the data in the following web

For this I use the rvest library:

url <- "https://stats.nba.com/teams/advanced/?sort=OFF_RATING&dir=-1&CF=MIN*GE*15&Season=2018-19&SeasonType=Regular%20Season"

tmp <- read_html(url)

tmp2 <- html_nodes(tmp, "table")

When running the last command the object obtained tmp2 has value 0.

Can someone give me a hand?

Thanks !!

    
asked by Uko 04.12.2018 в 19:39
source

1 answer

1

Problems

  • The table is not in direct html, the html of the table is created through javascript directly in the browser, then the method of reading the url with read_html() does not work. In other words, you would have to use a browser / robot as an intermediary to download the page, render and read the embedded javascript code and then read the result with read_html() .

  • The table is not a standard html table, the header of the table (the column names) of the body is separate. One is at node thead (which is empty of content) and the rest at node tbody . If you ask me I think that the owners of the page do not like scraping very much, although in its terms of use they do not explicitly prohibit it. Anyway I suggest you check it to make sure that what you want to do is not against the policy of the site. Basically it depends on the use that you are going to give to the data, they can be downloaded, but not publicly reproduced.

  • Solutions

  • In this question there are more details and a couple of approaches to that problem. The answer of @Patricio Moracho is the one that works in this case, with phantomjs I could not make it walk. I'm going to paste a code that worked for me to solve this step, without many explanations.

    library(RSelenium)  # Está de nuevo en CRAN!
    rD <- rsDriver()    #Copipasta del código de Patricio en la pregunta enlazada. Funciona, pero no sé bien como o por que. 
    remDr <- rD$client
    remDr$open()
    url <- "https://stats.nba.com/teams/advanced/?sort=OFF_RATING&dir=-1&CF=MIN*GE*15&Season=2018-19&SeasonType=Regular%20Season"
    remDr$navigate(url)
    doc <- remDr$getPageSource()[[1]]
    page <- read_html(doc)
    
  • With this we get to an object of the class xml_node that contains the table instead of a message that indicates that the table is being loaded.

  • The table is NOT standard, so it is not possible to use (at least I could not) html_table to get a well-formatted data.frame easily. The alternative that I found is:

    • Identify and extract the node before the table ( nba-stat-table )
    • Identify and extract the node with the body of the table ( tbody )
    • Extract a list with all table row nodes ( tr )
    • Extract the text of those nodes in a vector. It does not have the names well and does not have a table structure, but there is the information.
    • Pass that vector to an array of 30 columns, one per team.
    • Pass it to data.frame.
    • Manually paste the 18 variable names.
    • Pass it long so there is not one column per team (tidydata)
  • Inserted text to cut the previous list that breaks the rule of four spaces for the code.

        page %>%  
        html_node("nba-stat-table") %>%  
        html_node("tbody") %>%  
        html_nodes("td") %>%  
        html_text() %>%  
        matrix(ncol = 30) %>%  
        data.frame(stringsAsFactors = F) -> datos  
    
        as.character(datos[2,]) -> nombres  
    
        datos %>%   
          set_names(nombres) %>%  
          slice(3:ncol(datos)) %>%   
          mutate(dato = c("GP",     "W", "L", "MIN", "OffRtg", "DefRtg", "NetRtg",  "AST%", "AST/TO", "AST_Ratio", "OREB%", "DREB%",    "REB%", "TOV%", "eFG%", "TS%", "PACE",  "PIE")) %>% #Nombres a mano... muy feo   
          gather(equipo, valor, -dato) -> datos_limpios  
    
        head(datos_limpios)  
    
              dato          equipo  valor
          1     GP Milwaukee Bucks     22
          2      W Milwaukee Bucks     15
          3      L Milwaukee Bucks      7
          4    MIN Milwaukee Bucks 1066.0
        5 OffRtg Milwaukee Bucks  115.5
        6 DefRtg Milwaukee Bucks  106.0
    

    With this we get some long data that with 'spread (data_clean, data, value) pass tidydata: each row is a team and each column an attribute of that team. As you will see it works, but it has a lot of non-parametric ad hoc code.

    • the number of columns
    • The number and names of the variables. This is the most complicated to solve and it is given because the site separated the table into two parts. For visual effects in the browser does not generate problem, for scraping itself.

    I hope it helps you.

      

    Added: code to extract column names from thead

    page %>% 
      html_nodes(xpath = '//th[not(@hidden="") and not(@data-field="TEAM_NAME") and @sort=""]') %>% #Hay columnas escondidas y otras inncesarias, este xpath matchea solamente las informativas.
      html_text() -> nombres_columna
    
        
    answered by 04.12.2018 / 23:55
    source