Scrapy Python: Do a recursive search on a page

Question

Scrapy Python: Do a recursive search on a page

Navigation

1

I'm trying to do a recursive search on a web page with Scrapy. I have modified the value in the file of settings.py of DEPTH_LIMIT = 4 and my code is as follows:

class HreflocalizeSpider(scrapy.Spider):
   name = "hrefLocalize"
   allowed_domains = [URL]
   start_urls = (
    'URL_DE_BUSQUEDA',
   )
   rules = (    Rule(LinkExtractor(allow=()),callback='parse', follow=True))
   settings.overrides['DEPTH_LIMIT']= 4 #Puse esto para forzar el cambio
   settings.overrides['DEPTH_PRIORITY']= 4

   def parse(self, response):
      hxs = scrapy.Selector(response)
      lines = hxs.xpath("//@href").extract()
      linkPattern = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+))?$")
      for line in lines:
      print line
      if linkPattern.match(line):
              yield Request(line,self.parse)

But still with all this, the program always tells me that

'request_depth_max': 1

I have seen that it matters the middleware to do the search in depth, but still, it does not do that search.

Could someone help me out and tell me what I'm doing wrong?

Thank you very much in advance!

python

asked by Jose Vila 30.04.2016 в 19:04

source

0 answers

I / O Error: File system read-only in SSD (XUBUNTU) How to implement a swiperefreshlayout with recyclerview and cardview