I want to raise the complete news of the links that appear on the cover of an informative site. But the links are relative
The site is link
And the links look like this
<div class="article-title">
<a href="/v2/article.php?id=187222">Barros Schelotto: "No somos River y vamos a tratar de pasar a la final"</a>
</div>
The link would then be, "/v2/article.php?id=187222"
My code is as follows:
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http.request import Request
try:
from urllib.parse import urljoin # Python3.x
except ImportError:
from urlparse import urljoin # Python2.7
from puntalcomar.items import PuntalcomarItem
class PuntalComArSpider(CrawlSpider):
name = 'puntal.com.ar'
allowed_domains = ['http://www.puntal.com.ar/v2/']
start_urls = ['http://www.puntal.com.ar/v2/']
rules = (
Rule(LinkExtractor(allow=(''),), callback="parse_url", follow=True),
)
def parse_url(self, response):
hxs = Selector(response)
urls = hxs.xpath('//div[@class="article-title"]/a/@href').extract()
print 'enlace relativo ', urls
for url in urls:
urlfull = urljoin('http://www.puntal.com.ar',url)
print 'enlace completo ', urlfull
yield Request(urlfull, callback = self.parse_item)
def parse_item(self, response):
hxs = Selector(response)
dates = hxs.xpath('//span[@class="date"]')
title = hxs.xpath('//div[@class="title"]')
subheader = hxs.xpath('//div[@class="subheader"]')
body = hxs.xpath('//div[@class="body"]/p')
items = []
for date in dates:
item = PuntalcomarItem()
item["date"] = date.xpath('text()').extract()
item["title"] = title.xpath("text()").extract()
item["subheader"] = subheader.xpath('text()').extract()
item["body"] = body.xpath("text()").extract()
items.append(item)
return items
But it does not work.
I use Linux Mint with Python 2.7.6
Scrapy console output:
$ scrapy crawl puntal.com.ar
2016-07-10 13:39:15 [scrapy] INFO: Scrapy 1.1.0 started (bot: puntalcomar)
2016-07-10 13:39:15 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'puntalcomar.spiders', 'SPIDER_MODULES': ['puntalcomar.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'puntalcomar'}
2016-07-10 13:39:15 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-10 13:39:15 [scrapy] INFO: Enabled item pipelines:
['puntalcomar.pipelines.XmlExportPipeline']
2016-07-10 13:39:15 [scrapy] INFO: Spider opened
2016-07-10 13:39:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (404) <GET http://www.puntal.com.ar/robots.txt> (referer: None)
2016-07-10 13:39:15 [scrapy] DEBUG: Redirecting (301) to <GET http://www.puntal.com.ar/v2/> from <GET http://www.puntal.com.ar/v2>
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (200) <GET http://www.puntal.com.ar/v2/> (referer: None)
enlace relativo [u'/v2/article.php?id=187334', u'/v2/article.php?id=187324', u'/v2/article.php?id=187321', u'/v2/article.php?id=187316', u'/v2/article.php?id=187335', u'/v2/article.php?id=187308', u'/v2/article.php?id=187314', u'/v2/article.php?id=187315', u'/v2/article.php?id=187317', u'/v2/article.php?id=187319', u'/v2/article.php?id=187310', u'/v2/article.php?id=187298', u'/v2/article.php?id=187300', u'/v2/article.php?id=187299', u'/v2/article.php?id=187306', u'/v2/article.php?id=187305']
enlace completo http://www.puntal.com.ar/v2/article.php?id=187334
2016-07-10 13:39:15 [scrapy] DEBUG: Filtered offsite request to 'www.puntal.com.ar': <GET http://www.puntal.com.ar/v2/article.php?id=187334>
enlace completo http://www.puntal.com.ar/v2/article.php?id=187324
enlace completo http://www.puntal.com.ar/v2/article.php?id=187321
enlace completo http://www.puntal.com.ar/v2/article.php?id=187316
enlace completo http://www.puntal.com.ar/v2/article.php?id=187335
enlace completo http://www.puntal.com.ar/v2/article.php?id=187308
enlace completo http://www.puntal.com.ar/v2/article.php?id=187314
enlace completo http://www.puntal.com.ar/v2/article.php?id=187315
enlace completo http://www.puntal.com.ar/v2/article.php?id=187317
enlace completo http://www.puntal.com.ar/v2/article.php?id=187319
enlace completo http://www.puntal.com.ar/v2/article.php?id=187310
enlace completo http://www.puntal.com.ar/v2/article.php?id=187298
enlace completo http://www.puntal.com.ar/v2/article.php?id=187300
enlace completo http://www.puntal.com.ar/v2/article.php?id=187299
enlace completo http://www.puntal.com.ar/v2/article.php?id=187306
enlace completo http://www.puntal.com.ar/v2/article.php?id=187305
2016-07-10 13:39:15 [scrapy] INFO: Closing spider (finished)
2016-07-10 13:39:15 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 660,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 50497,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 726952),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 16,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 121104)}
2016-07-10 13:39:15 [scrapy] INFO: Spider closed (finished)
Solution:
The configuration for simultaneous requests in settings.py was missing.
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16
Moral: 'Do not take anything for granted'