How can you download a file with Python with multiple connections (for example: something like the IDM download manager)?

Question

How can you download a file with Python with multiple connections (for example: something like the IDM download manager)?

Navigation

#1 by (3 votes)

2

What I'm looking for is to accelerate the download of files this is what I have so far:

    ruta = os.getcwd()
    r = requests.get('https://video.xx.fbcdn.net/v/t42.90402/10000000_200409650451005_3436979597881638912_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9oZCJ9&oh=035040be54e5183aca05c759362942fb&oe=58E0FB19', stream=True)
    if r.status_code == 200:
       with open(os.path.join(ruta, "gin.mp4"), 'wb') as f:
           for chunk in r.iter_content():
               f.write(chunk)

python python-3.x

asked by johni 02.04.2017 в 11:18

source

1 answer

Select values from the XML field in SQL Server 2008 How can I do an SQL query that only shows me a record in a datagridview [closed]

score 3 · Accepted Answer

There are several ways to approach the problem as usual. The general idea would be to use several threads or processes and that each one be responsible for downloading a part of the file itself.

To do this, the server has to support the header Range , this allows us to obtain from the server a range of bytes of the file instead of the complete file. Therefore, the first task is to see if the server accepts or not ranges.

If you accept them, you need to tell us the total bytes of the file so that you can divide them in x intervals properly and then reconstruct them. Once the byte range has been calculated (remember that header Range includes both limits, unlike range in Python) we call a process for each part and ask you to download it. If everyone ends up doing their work it is just joining each fragment of bytes in a file, of course you have to do it in order if you do not want to get a nice file full of corrupt data.

A simple example using multiprocessing in Python 3.x would be:

import urllib.request
from multiprocessing import Process, Manager

def descargar(url,orden,rango,frag):
    try:
        print('Obteniendo fragmento {}. Descargando desde byte {} hasta byte {}.'.format(orden,*rango))
        req = urllib.request.Request(url)
        req.add_header('Range', 'bytes={}-{}'.format(*rango))
        data = urllib.request.urlopen(req).read()
        if data:
            frag[orden]=data
            print('Fragmento {} descargado correctamente. Obtenidos {} bytes.'.format(orden,len(data)))
        else:
            frag[orden]=None
    except:
        frag[orden]='#Error'
        raise

def descarga_paralela(url, fragmentos, nombre):
    ranges=None
    with urllib.request.urlopen(url) as f:
        #Comprobamos que el servidor acepte la descarga parcial.
        if f.getheader("Accept-Ranges", "none").lower() != "bytes":
            print('Descarga parcial no soportada, iniciando descarga...')
        else:
            print('Descarga parcial soportada')

            #Obtenemos el tamaño total del archivo
            size = int(f.getheader("Content-Length", "none"))
            print('Tamaño del archivo: {} bytes.'.format(size))

            #Dividimos ese tamaño en intervalos de acuerdo al número de procesos que lanzaremos
            tamF = size//fragmentos
            print('Fragmentos: {}.\nTamaño aproximado por fragmento: {} bytes.'.format(fragmentos, tamF))
            ranges = [[i, i+tamF-1] for i in range (0, size, tamF)]
            ranges[-1][-1]=size

            #Vamos a usar un diccionario compartido por los procesos, la clave será el orden que cada fragmento de bytes tiene en el archivo final.
            manager = Manager()
            d = manager.dict()
            #Lanzamos los procesos
            workers = [Process(target=descargar, args=(url,i,r,d)) for i, r in enumerate(ranges)]
            for w in workers:
                w.start()
            for w in workers:
                w.join()

            #reconstruimos el archivo usando cada fragmento en su orden correcto:
            with open(nombre, 'wb') as f:
                for i in range(fragmentos):
                    data = d[i]
                    if data == None or data == '#Error':
                        print('El fragmento {} no se puedo descargar. No se puede reconstruir el archivo'.format(i))
                        break
                    else:
                        f.write(data)
                else:
                    print('Archivo descargado y reconstruido con éxito.')



if __name__ == '__main__':
    url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/51/Bow_Lake_beim_Icefields_Parkway.jpg/1280px-Bow_Lake_beim_Icefields_Parkway.jpg'
    descarga_paralela(url, 10, 'imagen.jpg')

In this case we download a wikipedia image using 10 parallel requests. If we execute it, we get an exit like the following:

And of course the image in the directory where the script is ...:)

The code is just an example using the standard library only to show what the general idea would be. It is improvable in many aspects. We can implement retries if a fragment fails (now if a process fails we can say goodbye to our XD download), try to get the name of the server file if it is available, implement nice progress bars, implement breaks and resumes, ultimately complicate our lives everything we want