scraping horizontal and vertical web with scrapy

Question

scraping horizontal and vertical web with scrapy

Navigation

#1 by (1 votes)

0

I'm new to scraping and I'm doing horizontal and vertical scraping with scrapy. at the moment of executing my code, it generates the .csv file but empty, without the scraped information. This is my code. Can someone tell me what's wrong?

_author_ = 'jesus toxort'

from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.selector import Selector

class AirbnbItem(Item):
    regla = Field()
    id = Field()

class AirbnbCrawler(CrawlSpider):
    name = "CrawlerAirbnb"
    start_urls= ["https://www.airbnb.com/s/Londres--Reino-Unido"]
    allowed_domains = ['airbnb.com']
    rules = (
            Rule(LinkExtractor(allow=r'offset=')),
            Rule(LinkExtractor(allow=r'/rooms'), callback = 'parse_items'),
            )
def parse_items(self, response):
    sel = Selector(response)
    reglas = sel.xpath('//*[@id="house-rules"]/div/section') 
    for i, elem in enumerate(reglas):
        item = ItemLoader(AirbnbItem(), elem)
        item.add_xpath('regla', './/div/div/text()')
        item.add_value('id', i)
        yield item.load_item()

python scrapy

asked by JesusToxqui 25.05.2018 в 23:11

source

1 answer

How to make a personalized data annotation? How to verify that it is a true text in an HTML form

score 1 · Accepted Answer

If, as @fredyfx says, the page gets its information with POST requests (executed from Javascript) that implies that without downloading the HTML, the elements that you are looking for with XPath expressions simply do not exist yet, because they will be created from Javascript when the requests to the answers are received.

This can be easily checked. Download the page:

$ wget https://www.airbnb.com/s/Londres--Reino-Unido

We see if it contains the string "/rooms" , which is what your scraper uses to extract links:

$  grep /rooms Londres--Reino-Unido

Nothing comes out. The string does not appear on the page. If instead you run it in a browser and use the "inspect page" tool, you will see HTML elements such as:

<meta itemprop="url" content="www.airbnb.es/rooms/17247557?location=Londres%2C%20Reino%20Unido">

These elements were not in the HTML downloaded, but appear there as a result of certain javascript executed by the browser, following instructions contained in the downloaded document itself or in external scripts linked from that document.

Therefore, and in short, to scraping dynamically generated pages, it is necessary to execute the corresponding javascript, which requires a real browser because python can not execute javascript.

Although it is a great inconvenience, it is not impossible. Many modern browsers incorporate the ability to be "controlled" from a script. So python could launch a browser, send it load that page (the browser would run the javascript and generate dynamic content) and then python can retrieve the generated HTML, or even simulate a user's actions such as "click on such button", or "move" the mouse on top of such an image ".

A package for python that allows you to do these things is requests-html . Unfortunately I do not see how to integrate it into scrapy. It is rather to download a single page and scraping on it, and not to make spiders that can continue downloading automatically found links.

Another, which uses a different principle, is scrapy-splash . This allows to integrate splash to be used from scrapy . splash is a software that acts as a server and acts as proxy between scrapy and the real server. The page is downloaded, the javascript is executed, and it is served to scrapy.

I have not used it and I can not tell you how it goes, but a priori it seems difficult to install, because the recommended method in the manual is to have splash running in a docker container ...