RaTengo doubt when scraping this site

0

I have little knowledge with scrapy, although I have been able to scrape several pages, I am trying to scrape some retail in my country, to make things easier for the population when comparing the prices of the products. and here is my problems.

the scrape resppnde 200 seems to be all right, but does not enter the pages to do the scraping, here is my code.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from paris.items import ParisItem


class MercadoSpider(CrawlSpider):
    name = 'paris'
    allowed_domains = ['www.paris.cl']
    start_urls = ['https://www.paris.cl/store/categoria/electro-television-todas-las-tv',
                  'https://www.paris.cl/store/categoria/electro-television-smart-tv',
                  'https://www.paris.cl/store/categoria/electro-television-ultrahd',
                  'https://www.paris.cl/store/categoria/electro-television-curvo-oled',
                  'https://www.paris.cl/store/categoria/electro-television-monitores',
                  'https://www.paris.cl/store/categoria/electro-accesorios-soportes-y-cables',
                  'https://www.paris.cl/store/categoria/electro-accesorios-tv-home-theater',
                  'https://www.paris.cl/store/categoria/electro-accesorios-tv-bluray-dvd']
    rules = {
        # Para cada item
        Rule(LinkExtractor(allow=(), restrict_xpaths=('//*[contains(@class,"load-more-products")]'))),
        Rule(LinkExtractor(allow=(), restrict_xpaths=('//div[@class="itemPromo"]')),
                            callback='parse_item',  follow=False)
    }



    def parse_item(self, response):
        ml_item = ParisItem()
        ml_item['title1'] = response.xpath('normalize-space(//h1[@class="detalles-titulo"]/text())').extract_first()
        ml_item['precio_normal'] = response.xpath('normalize-space(//p[@class="precio_normal"])').extract_first()
        ml_item['precio_internet1'] = response.xpath('normalize-space(//p[@class="internet-price"])').extract_first()
        ml_item['precio_internet2'] = response.xpath('normalize-space(//p[@class="offerPrice"]/text()[1])').extract_first()
        ml_item['tarjeta'] = response.xpath('normalize-space(//p[@class="offerPrice"]/text()[1])').extract_first()
        ml_item['descuentos'] = response.xpath('normalize-space(//span[@class="discount"])').extract_first()
        ml_item['codigo'] = response.xpath('normalize-space//*[@id="product"]/div[2]/div[4]/div/div[1]/div/p[1]/text()[2])').extract_first()
        ml_item['stock'] = response.xpath('normalize-space(//*[@class="cta-btn ajacAddToCart"]/text())').extract_first()
        ml_item['categoria'] = response.xpath('normalize-space(//*[@id="breadcrumb"]/a[1])').extract_first()
        yield ml_item


Aqui la respuesta:


2018-11-08 00:38:01 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: paris)
2018-11-08 00:38:01 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.12 (default, Dec  4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Linux-4.15.0-38-generic-x86_64-with-Ubuntu-16.04-xenial
2018-11-08 00:38:01 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'paris.spiders', 'SPIDER_MODULES': ['paris.spiders'], 'BOT_NAME': 'paris'}
2018-11-08 00:38:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-11-08 00:38:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-08 00:38:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-08 00:38:01 [scrapy.middleware] INFO: Enabled item pipelines:
['paris.pipelines.ParisPipeline']
2018-11-08 00:38:01 [scrapy.core.engine] INFO: Spider opened
2018-11-08 00:38:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-08 00:38:01 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-ultrahd> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-smart-tv> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-curvo-oled> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-accesorios-tv-bluray-dvd> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-monitores> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-todas-las-tv> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-accesorios-soportes-y-cables> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-accesorios-tv-home-theater> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-08 00:38:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2059,
 'downloader/request_count': 8,
 'downloader/request_method_count/GET': 8,
 'downloader/response_bytes': 7356,
 'downloader/response_count': 8,
 'downloader/response_status_count/200': 8,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 11, 8, 3, 38, 1, 957792),
 'log_count/DEBUG': 9,
 'log_count/INFO': 7,
 'memusage/max': 52629504,
 'memusage/startup': 52629504,
 'response_received_count': 8,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'start_time': datetime.datetime(2018, 11, 8, 3, 38, 1, 243399)}
2018-11-08 00:38:01 [scrapy.core.engine] INFO: Spider closed (finished)
    
asked by Raul Moreno Segura 08.11.2018 в 04:39
source

1 answer

0

The site has all the content in Javascript, that is to say dynamically and that Scrapy can not handle it directly. You have to search the browser console, network flap, XHR tab, the links that return the content in a json. Here is an example: link

    
answered by 13.11.2018 в 02:50