How can scraping be done on a web page that has javascript with python 3?

Question

How can scraping be done on a web page that has javascript with python 3?

Navigation

#1 by (1 votes)

1

Hi, I would like to know how I can scrap a web page that has Javascript using PyQt5, the page from which I want to extract information is this: link

From this page I want to take the name of the series and the chapters. This is what I have so far:

import sys
from bs4 import BeautifulSoup
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
from urllib.request import urlopen


class Render(QWebPage):


    def __init__(self, html):
        self.html = None
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().setHtml(html)
        self.app.exec_()

    def _loadFinished(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


url = 'https://www.tumangaonline.com/biblioteca/mangas/22954/Sonomono-Nochi-Ni'

source_html = urlopen(url)

rendered_html = Render(source_html.read()).html

soup = BeautifulSoup(rendered_html, 'lxml')

p = soup.find_all('a')

print('title is %r' % soup.select_one('title').text)
print(p)

python python-3.x

asked by johni 16.08.2017 в 20:44

source

1 answer

Help with a query of two tables Problems with laravel routes

score 1 · Answer 1

I have not tried your code but I have enough experience scraping with BeautifulSoup and it seems you are doing well.

The problem is that when the page has javascript and is dynamic when you make the request to the page the information is not yet in the html.

To solve that, the first thing I usually do is using Postman Interceptor intercept requests while I normally access the page where the information is with the browser.

After that, identify the request made with javascript or ajax that returns the information you want and once you identify the request replicate yourself with python with either urlopen or requests.

Once you get it, extract the information of the response of said request with BeutifulSoup or something else.