Extract data from a web page

Question

Extract data from a web page

Navigation

#1 by (1 votes)

1

I am investigating how to extract data from this web page (Updated price information). This generates a .xls file. My idea is to automate this extraction of files for example with some phyton library. I have seen other tools but I would like to do it in code.

link

Greetings

html python

asked by Jav 03.08.2018 в 09:31

source

1 answer

Obtain data with the lowest sequence number ERROR 1045 (28000): Access denied for user 'root' @ 'localhost' when loading sql

score 1 · Answer 1

The general method can be very complex, depending on what technology the web page is made of, what data you want to obtain, if it is necessary to log in before, and if the authors have decided to make it difficult.

In this particular case it is not very complicated, although it could have been even simpler.

You must start by finding out the URL from which the excel sheet is actually being downloaded. For this the tools for developers that today include most web browsers are an indispensable aid. Using these tools we can see the code of the "Download" button and see that it is part of an HTML form, but this form does not make the download URL visible, because it uses javascript to connect to the server, invoking a function called downloadFile() .

We could continue to dive by the code of the page to see what that function does, and in some more complex cases it would be necessary to do it, but in this case we can choose another way.

In the toolkit for the developer there is one called "Network", which allows you to observe the traffic of HTTP requests that is taking place. With that tab in view, click on the download button and we see that a request GET occurs at URL http://geoportalgasolineras.es/downloadReportPrecios?tipoEstacion=EESS&productoId=1 .

This is what we needed. You do not need to program anything with python, you can use a command line tool like wget or curl to download that file directly. For example:

curl "http://geoportalgasolineras.es/downloadReportPrecios?tipoEstacion=EESS&productoId=1" > precios.xls

If you insist on doing it in python, using the requests library the thing is simple:

import requests

def download_file(url, nombre_local):
    r = requests.get(url)
    if r.status_code != 200:
        print("Error {}: {}".format(r.status_code, r.reason))
        return
    with open(nombre_local, 'wb') as f:
        f.write(r.content)


download_file('http://geoportalgasolineras.es/downloadReportPrecios?tipoEstacion=EESS&productoId=1',
              'precios.xls')

The file is large (2.2Mb) and it takes a while to download. During that time it seems that the program is doing nothing. So I thought about transferring it in stream mode, in which instead of receiving the complete file, we will receive it by bits (and so we can print something between bits and pieces to see the speed at which it is downloaded. ). The code for this second implementation would be like this:

import requests

def download_file(url, nombre_local):
    r = requests.get(url, stream=True)
    with open(nombre_local, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # Los chunks vacíos se saltan
                print(".", end="", flush=True)
                f.write(chunk)
download_file('http://geoportalgasolineras.es/downloadReportPrecios?tipoEstacion=EESS&productoId=1',
              'precios.xls')

Unfortunately for it to work properly the server should support the sending of information by pieces, and it seems that is not the case since if you execute the above you will see that for a long time apparently nothing happens, and then start printing the dots at full speed (I understand that requests has downloaded the complete file and then we are giving it by bits in the loop).

Final clarification This method does not guarantee that tomorrow can stop working if the authors of the page decide to change the URL from which the data is downloaded, or request some type of cookie the client. In general, if the page to be analyzed becomes more complicated, it can be much better to approach a real browser from a script, so that it simulates the user's actions such as "press button", etc. This is generally more complicated to configure. Look at projects like selenium or requests-html