Load different URLs without closing the browser

4

Through this code, I load different URLs in the browser to extract the source, but I close the browser after each reading, can I reload with the following URL in the list?

from selenium import webdriver
from bs4 import BeautifulSoup

delFichero = file('listado.txt', 'r')

for n in delFichero:

    URL = str(n)

    browser = webdriver.Firefox()
    browser.get(URL)
    content = browser.page_source

    soup = BeautifulSoup(content)

    browser.quit()
    
asked by chikilicuatre 03.08.2017 в 14:53
source

2 answers

1

You just have to modify your code a bit. Create browser outside the loop and close it outside the loop. Something like this:

from selenium import webdriver


urls = [
    'https://www.google.com',
    'https://duckduckgo.com'
]
browser = webdriver.Firefox()
fuentes = {}

for url in urls:
    browser.get(url)
    fuentes[url] = browser.page_source

browser.quit()

With this you already have the source for each URL:

print fuentes['https://www.google.com'][:100]

Would result in:

<html itemscope="" itemtype="http://schema.org/WebPage" lang="es-419"><head><meta content="/images/b
    
answered by 03.08.2017 в 16:53
1

Of course you can. In passing some modifications to your code:

from selenium import webdriver
from bs4 import BeautifulSoup

with open("listado.txt", "r") as delFichero:

    browser = webdriver.Firefox()
    for linea in delFichero:
        url = linea.strip()
        browser.get(url)
        content = browser.page_source
        soup = BeautifulSoup(content, "html.parser")
        print(url)

    browser.quit()

First of all we used a contextmanager to handle the reading of the file the following way:

with open("listado.txt", "r") as delFichero

This way is much safer because we can forget to close the file, in fact you forgot to add the close. The contextmanager knows when you have stopped reading in this case and automatically closes the file.

One detail, by doing this: url = linea.strip() we remove the line breaks that we read from the file, in your code it seems that it is not necessary but it is always a good detail.

The instantiation of BeautifulSoup has changed, and should be done like this: BeautifulSoup(content, "html.parser") specifying the appropriate "parser".

If you analyze the new code you will see that now the browser.quit() is out of the cycle, so the browser will not close until it has completed it.

    
answered by 03.08.2017 в 17:05