Problem with relative links in Scrapy [closed]

2

I want to raise the complete news of the links that appear on the cover of an informative site. But the links are relative

The site is link

And the links look like this

<div class="article-title">
            <a href="/v2/article.php?id=187222">Barros Schelotto: "No somos River y vamos a tratar de pasar a la final"</a>
        </div>

The link would then be, "/v2/article.php?id=187222"

My code is as follows:

# -*- coding: utf-8 -*-

from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http.request import Request
try:
    from urllib.parse import urljoin # Python3.x
except ImportError:
    from urlparse import urljoin # Python2.7

from puntalcomar.items import PuntalcomarItem


class PuntalComArSpider(CrawlSpider):
    name = 'puntal.com.ar'
    allowed_domains = ['http://www.puntal.com.ar/v2/']
    start_urls = ['http://www.puntal.com.ar/v2/']

    rules = (
            Rule(LinkExtractor(allow=(''),), callback="parse_url", follow=True),
        )

    def parse_url(self, response):
        hxs = Selector(response)
        urls = hxs.xpath('//div[@class="article-title"]/a/@href').extract()
        print 'enlace relativo ', urls
        for url in urls:
           urlfull = urljoin('http://www.puntal.com.ar',url)
           print 'enlace completo ', urlfull
           yield Request(urlfull, callback = self.parse_item)

    def parse_item(self, response):
        hxs = Selector(response)
        dates = hxs.xpath('//span[@class="date"]')
        title = hxs.xpath('//div[@class="title"]')
        subheader = hxs.xpath('//div[@class="subheader"]')
        body = hxs.xpath('//div[@class="body"]/p')
        items = []
        for date in dates:
            item =  PuntalcomarItem()
            item["date"] = date.xpath('text()').extract()
            item["title"] = title.xpath("text()").extract()
            item["subheader"] = subheader.xpath('text()').extract()
            item["body"] = body.xpath("text()").extract()
            items.append(item)
        return items

But it does not work.

I use Linux Mint with Python 2.7.6

Scrapy console output:

$ scrapy crawl puntal.com.ar
2016-07-10 13:39:15 [scrapy] INFO: Scrapy 1.1.0 started (bot: puntalcomar)
2016-07-10 13:39:15 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'puntalcomar.spiders', 'SPIDER_MODULES': ['puntalcomar.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'puntalcomar'}
2016-07-10 13:39:15 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-10 13:39:15 [scrapy] INFO: Enabled item pipelines:
['puntalcomar.pipelines.XmlExportPipeline']
2016-07-10 13:39:15 [scrapy] INFO: Spider opened
2016-07-10 13:39:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (404) <GET http://www.puntal.com.ar/robots.txt> (referer: None)
2016-07-10 13:39:15 [scrapy] DEBUG: Redirecting (301) to <GET http://www.puntal.com.ar/v2/> from <GET http://www.puntal.com.ar/v2>
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (200) <GET http://www.puntal.com.ar/v2/> (referer: None)
enlace relativo  [u'/v2/article.php?id=187334', u'/v2/article.php?id=187324', u'/v2/article.php?id=187321', u'/v2/article.php?id=187316', u'/v2/article.php?id=187335', u'/v2/article.php?id=187308', u'/v2/article.php?id=187314', u'/v2/article.php?id=187315', u'/v2/article.php?id=187317', u'/v2/article.php?id=187319', u'/v2/article.php?id=187310', u'/v2/article.php?id=187298', u'/v2/article.php?id=187300', u'/v2/article.php?id=187299', u'/v2/article.php?id=187306', u'/v2/article.php?id=187305']
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187334
2016-07-10 13:39:15 [scrapy] DEBUG: Filtered offsite request to 'www.puntal.com.ar': <GET http://www.puntal.com.ar/v2/article.php?id=187334>
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187324
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187321
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187316
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187335
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187308
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187314
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187315
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187317
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187319
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187310
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187298
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187300
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187299
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187306
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187305
2016-07-10 13:39:15 [scrapy] INFO: Closing spider (finished)
2016-07-10 13:39:15 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 660,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 50497,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 726952),
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'offsite/domains': 1,
 'offsite/filtered': 16,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 121104)}
2016-07-10 13:39:15 [scrapy] INFO: Spider closed (finished)

Solution:

The configuration for simultaneous requests in settings.py was missing.

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16

Moral: 'Do not take anything for granted'

    
asked by dedio 10.07.2016 в 01:26
source

1 answer

1

With what you say it is not clear where the error is or show the complete error. I understand that the error must be in the parse_url method. See if the following works for you:

try:
    from urllib.parse import urljoin # Python3.x
except ImportError:
    from urlparse import urljoin # Python2.7

# ...

class PuntalComArSpider(CrawlSpider):
    # ...

    def parse_url(self, response):
        hxs = Selector(response)
        urls = hxs.xpath('//div[@class="article-title"]/a/@href').extract()
        for url in urls:
           #################################################
           # La siguiente línea está adaptada
           urlfull = urljoin('http://www.puntal.com.ar',url)
           #################################################
           yield Request(urlfull, callback = self.parse_item)

If it does not work for you, please correct your question to be clearer about where the error is coming from.

    
answered by 10.07.2016 в 11:05