I have little knowledge with scrapy, although I have been able to scrape several pages, I am trying to scrape some retail in my country, to make things easier for the population when comparing the prices of the products. and here is my problems.
the scrape resppnde 200 seems to be all right, but does not enter the pages to do the scraping, here is my code.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from paris.items import ParisItem
class MercadoSpider(CrawlSpider):
name = 'paris'
allowed_domains = ['www.paris.cl']
start_urls = ['https://www.paris.cl/store/categoria/electro-television-todas-las-tv',
'https://www.paris.cl/store/categoria/electro-television-smart-tv',
'https://www.paris.cl/store/categoria/electro-television-ultrahd',
'https://www.paris.cl/store/categoria/electro-television-curvo-oled',
'https://www.paris.cl/store/categoria/electro-television-monitores',
'https://www.paris.cl/store/categoria/electro-accesorios-soportes-y-cables',
'https://www.paris.cl/store/categoria/electro-accesorios-tv-home-theater',
'https://www.paris.cl/store/categoria/electro-accesorios-tv-bluray-dvd']
rules = {
# Para cada item
Rule(LinkExtractor(allow=(), restrict_xpaths=('//*[contains(@class,"load-more-products")]'))),
Rule(LinkExtractor(allow=(), restrict_xpaths=('//div[@class="itemPromo"]')),
callback='parse_item', follow=False)
}
def parse_item(self, response):
ml_item = ParisItem()
ml_item['title1'] = response.xpath('normalize-space(//h1[@class="detalles-titulo"]/text())').extract_first()
ml_item['precio_normal'] = response.xpath('normalize-space(//p[@class="precio_normal"])').extract_first()
ml_item['precio_internet1'] = response.xpath('normalize-space(//p[@class="internet-price"])').extract_first()
ml_item['precio_internet2'] = response.xpath('normalize-space(//p[@class="offerPrice"]/text()[1])').extract_first()
ml_item['tarjeta'] = response.xpath('normalize-space(//p[@class="offerPrice"]/text()[1])').extract_first()
ml_item['descuentos'] = response.xpath('normalize-space(//span[@class="discount"])').extract_first()
ml_item['codigo'] = response.xpath('normalize-space//*[@id="product"]/div[2]/div[4]/div/div[1]/div/p[1]/text()[2])').extract_first()
ml_item['stock'] = response.xpath('normalize-space(//*[@class="cta-btn ajacAddToCart"]/text())').extract_first()
ml_item['categoria'] = response.xpath('normalize-space(//*[@id="breadcrumb"]/a[1])').extract_first()
yield ml_item
Aqui la respuesta:
2018-11-08 00:38:01 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: paris)
2018-11-08 00:38:01 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.12 (default, Dec 4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Linux-4.15.0-38-generic-x86_64-with-Ubuntu-16.04-xenial
2018-11-08 00:38:01 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'paris.spiders', 'SPIDER_MODULES': ['paris.spiders'], 'BOT_NAME': 'paris'}
2018-11-08 00:38:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-11-08 00:38:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-08 00:38:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-08 00:38:01 [scrapy.middleware] INFO: Enabled item pipelines:
['paris.pipelines.ParisPipeline']
2018-11-08 00:38:01 [scrapy.core.engine] INFO: Spider opened
2018-11-08 00:38:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-08 00:38:01 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-ultrahd> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-smart-tv> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-curvo-oled> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-accesorios-tv-bluray-dvd> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-monitores> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-television-todas-las-tv> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-accesorios-soportes-y-cables> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paris.cl/store/categoria/electro-accesorios-tv-home-theater> (referer: None)
2018-11-08 00:38:01 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-08 00:38:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2059,
'downloader/request_count': 8,
'downloader/request_method_count/GET': 8,
'downloader/response_bytes': 7356,
'downloader/response_count': 8,
'downloader/response_status_count/200': 8,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 8, 3, 38, 1, 957792),
'log_count/DEBUG': 9,
'log_count/INFO': 7,
'memusage/max': 52629504,
'memusage/startup': 52629504,
'response_received_count': 8,
'scheduler/dequeued': 8,
'scheduler/dequeued/memory': 8,
'scheduler/enqueued': 8,
'scheduler/enqueued/memory': 8,
'start_time': datetime.datetime(2018, 11, 8, 3, 38, 1, 243399)}
2018-11-08 00:38:01 [scrapy.core.engine] INFO: Spider closed (finished)