Format the scrapy spider response.css result

Question

Format the scrapy spider response.css result

Navigation

#1 by (0 votes)

1

When I execute my spider the value I want to get I save them in a dictionary, but I also want to create a folder with the name of one of the results

def parse(self, response):

    ml_item = ScrapyItem()
    mt_item = ScrapyItem()

    mt_item['title'] = response.css('div.info h1::text').extract()
    mt_item['Parodies'] = response.css('span.characters 

    name = str(mt_item['title'])
    os.mkdir(name)

The problem is that it is saved as [u"Winter's Tale"] .

How can I format it to take out only the words?

python django scrapy

asked by valentin rodriguez 23.06.2017 в 06:37

source

1 answer

Reduce costs in windows azure if you have several sites Error while vagrant up "network collision"

score 0 · Accepted Answer

I have attached a routine that I use in these cases, it works with unicode strings and cleans several characters not compatible with file systems.

import re

from unicodedata import normalize

_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\]^_'{|},.:]+')

def normalize_filename(text, delim='-'):
    """Normaliza una cadena para ser usada como nombre de archivo.

    Args:
        text (str): String a normalizar
        delim (str): Caracter de reemplazo de aquellos no vÃ¡lidos

    Ejemplo:
        >>> normalize_filename(u"Esto, no es vÃ¡lido como nombre de Archivo!", "-")
        'esto-no-es-valido-como-nombre-de-archivo'
    """
    result = []
    for word in _punct_re.split(text.lower()):
        word = normalize('NFKD', word).encode('ascii', 'ignore')
        word = word.decode('utf-8')
        if word:
            result.append(word)
    return delim.join(result)

In _punct_re we define the regular expression of the characters that we are going to clean, which are really considered as separators and we finish them by completing the value of the delim parameter

Example:

print(normalize_filename(u"Esto, no es vÃ¡lido como nombre de Archivo!", "-"))
print(normalize_filename("Tampoco?funionaría esto? eh!!!!", "-"))

The exit:

esto-no-es-valido-como-nombre-de-archivo
tampoco-funionaria-esto-eh