The key is to keep and reuse session cookies between requests. The session concept implies maintaining certain parameters through multiple requests, which includes maintaining the generated cookies in all the requests made in it.
Using urllib
of the Python 3 stdlib, we have the module http.cookiejar
, which together with urllib.request.OpenerDirector
allows us to manage cookies. Keep in mind that it will always be more cumbersome than if you use tools with more abstraction as requests
(see Advanced use - Session objects ) or scrapy
.
I leave a small example that allows you to authenticate in StackOverflow, keeping the session between requests, at the end a request is made to the site in Spanish and the html is stored in a file that you can open with the browser to quickly see the result.
from bs4 import BeautifulSoup
import http.cookiejar
import urllib
EMAIL = "email"
PASSWORD = "password"
BASE_URL = 'https://stackoverflow.com/'
LOGIN_URL = 'https://stackoverflow.com/users/login'
ES_BASE_URL = "https://es.stackoverflow.com/"
USER_AGENT = 'Mozilla/5.0 (Ubuntu; X11; Linux i686; rv:8.0) Gecko/20100101 Firefox/8.0'
HEADERS = {'User-Agent': USER_AGENT}
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)
values = {
'email' : EMAIL,
'password' : PASSWORD,
}
data = urllib.parse.urlencode(values).encode("utf-8")
req = urllib.request.Request(LOGIN_URL, data, HEADERS)
with urllib.request.urlopen(req) as response:
html = response.read()
# Posteriores peticiones mantendran la sesión
req = urllib.request.Request(ES_BASE_URL)
with urllib.request.urlopen(req) as response:
html = response.read()
with open("so_es.html", "wb") as f:
f.write(html)
If you comment on the line urllib.request.install_opener(opener)
you can see how the session is not maintained between requests. The example is very basic, for example, if necessary you can store the cookies on disk and be loaded later to be reused.