There are several ways to approach the problem as usual. The general idea would be to use several threads or processes and that each one be responsible for downloading a part of the file itself.
To do this, the server has to support the header Range
, this allows us to obtain from the server a range of bytes of the file instead of the complete file. Therefore, the first task is to see if the server accepts or not ranges.
If you accept them, you need to tell us the total bytes of the file so that you can divide them in x intervals properly and then reconstruct them. Once the byte range has been calculated (remember that header Range
includes both limits, unlike range
in Python) we call a process for each part and ask you to download it. If everyone ends up doing their work it is just joining each fragment of bytes in a file, of course you have to do it in order if you do not want to get a nice file full of corrupt data.
A simple example using multiprocessing in Python 3.x would be:
import urllib.request
from multiprocessing import Process, Manager
def descargar(url,orden,rango,frag):
try:
print('Obteniendo fragmento {}. Descargando desde byte {} hasta byte {}.'.format(orden,*rango))
req = urllib.request.Request(url)
req.add_header('Range', 'bytes={}-{}'.format(*rango))
data = urllib.request.urlopen(req).read()
if data:
frag[orden]=data
print('Fragmento {} descargado correctamente. Obtenidos {} bytes.'.format(orden,len(data)))
else:
frag[orden]=None
except:
frag[orden]='#Error'
raise
def descarga_paralela(url, fragmentos, nombre):
ranges=None
with urllib.request.urlopen(url) as f:
#Comprobamos que el servidor acepte la descarga parcial.
if f.getheader("Accept-Ranges", "none").lower() != "bytes":
print('Descarga parcial no soportada, iniciando descarga...')
else:
print('Descarga parcial soportada')
#Obtenemos el tamaño total del archivo
size = int(f.getheader("Content-Length", "none"))
print('Tamaño del archivo: {} bytes.'.format(size))
#Dividimos ese tamaño en intervalos de acuerdo al número de procesos que lanzaremos
tamF = size//fragmentos
print('Fragmentos: {}.\nTamaño aproximado por fragmento: {} bytes.'.format(fragmentos, tamF))
ranges = [[i, i+tamF-1] for i in range (0, size, tamF)]
ranges[-1][-1]=size
#Vamos a usar un diccionario compartido por los procesos, la clave será el orden que cada fragmento de bytes tiene en el archivo final.
manager = Manager()
d = manager.dict()
#Lanzamos los procesos
workers = [Process(target=descargar, args=(url,i,r,d)) for i, r in enumerate(ranges)]
for w in workers:
w.start()
for w in workers:
w.join()
#reconstruimos el archivo usando cada fragmento en su orden correcto:
with open(nombre, 'wb') as f:
for i in range(fragmentos):
data = d[i]
if data == None or data == '#Error':
print('El fragmento {} no se puedo descargar. No se puede reconstruir el archivo'.format(i))
break
else:
f.write(data)
else:
print('Archivo descargado y reconstruido con éxito.')
if __name__ == '__main__':
url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/51/Bow_Lake_beim_Icefields_Parkway.jpg/1280px-Bow_Lake_beim_Icefields_Parkway.jpg'
descarga_paralela(url, 10, 'imagen.jpg')
In this case we download a wikipedia image using 10 parallel requests. If we execute it, we get an exit like the following:
And of course the image in the directory where the script is ...:)
The code is just an example using the standard library only to show what the general idea would be. It is improvable in many aspects. We can implement retries if a fragment fails (now if a process fails we can say goodbye to our XD download), try to get the name of the server file if it is available, implement nice progress bars, implement breaks and resumes, ultimately complicate our lives everything we want