Load different URLs without closing the browser

Question

Load different URLs without closing the browser

Navigation

#1 by (1 votes)
#2 by (1 votes)

4

Through this code, I load different URLs in the browser to extract the source, but I close the browser after each reading, can I reload with the following URL in the list?

from selenium import webdriver
from bs4 import BeautifulSoup

delFichero = file('listado.txt', 'r')

for n in delFichero:

    URL = str(n)

    browser = webdriver.Firefox()
    browser.get(URL)
    content = browser.page_source

    soup = BeautifulSoup(content)

    browser.quit()

python selenium beautifulsoup

asked by chikilicuatre 03.08.2017 в 12:53

source

2 answers

Problem to save data with CKEDITOR PHP AND MYSQL Simplify a heavy document (1 GB)

score 1 · Answer 1

You just have to modify your code a bit. Create browser outside the loop and close it outside the loop. Something like this:

from selenium import webdriver


urls = [
    'https://www.google.com',
    'https://duckduckgo.com'
]
browser = webdriver.Firefox()
fuentes = {}

for url in urls:
    browser.get(url)
    fuentes[url] = browser.page_source

browser.quit()

With this you already have the source for each URL:

print fuentes['https://www.google.com'][:100]

Would result in:

<html itemscope="" itemtype="http://schema.org/WebPage" lang="es-419"><head><meta content="/images/b

score 1 · Answer 2

Of course you can. In passing some modifications to your code:

from selenium import webdriver
from bs4 import BeautifulSoup

with open("listado.txt", "r") as delFichero:

    browser = webdriver.Firefox()
    for linea in delFichero:
        url = linea.strip()
        browser.get(url)
        content = browser.page_source
        soup = BeautifulSoup(content, "html.parser")
        print(url)

    browser.quit()

First of all we used a contextmanager to handle the reading of the file the following way:

with open("listado.txt", "r") as delFichero

This way is much safer because we can forget to close the file, in fact you forgot to add the close. The contextmanager knows when you have stopped reading in this case and automatically closes the file.

One detail, by doing this: url = linea.strip() we remove the line breaks that we read from the file, in your code it seems that it is not necessary but it is always a good detail.

The instantiation of BeautifulSoup has changed, and should be done like this: BeautifulSoup(content, "html.parser") specifying the appropriate "parser".

If you analyze the new code you will see that now the browser.quit() is out of the cycle, so the browser will not close until it has completed it.