Format the scrapy spider response.css result

1

When I execute my spider the value I want to get I save them in a dictionary, but I also want to create a folder with the name of one of the results

def parse(self, response):

    ml_item = ScrapyItem()
    mt_item = ScrapyItem()

    mt_item['title'] = response.css('div.info h1::text').extract()
    mt_item['Parodies'] = response.css('span.characters 

    name = str(mt_item['title'])
    os.mkdir(name)

The problem is that it is saved as [u"Winter's Tale"] .

How can I format it to take out only the words?

    
asked by valentin rodriguez 23.06.2017 в 08:37
source

1 answer

0

I have attached a routine that I use in these cases, it works with unicode strings and cleans several characters not compatible with file systems.

import re

from unicodedata import normalize

_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\]^_'{|},.:]+')

def normalize_filename(text, delim='-'):
    """Normaliza una cadena para ser usada como nombre de archivo.

    Args:
        text (str): String a normalizar
        delim (str): Caracter de reemplazo de aquellos no válidos

    Ejemplo:
        >>> normalize_filename(u"Esto, no es válido como nombre de Archivo!", "-")
        'esto-no-es-valido-como-nombre-de-archivo'
    """
    result = []
    for word in _punct_re.split(text.lower()):
        word = normalize('NFKD', word).encode('ascii', 'ignore')
        word = word.decode('utf-8')
        if word:
            result.append(word)
    return delim.join(result)

In _punct_re we define the regular expression of the characters that we are going to clean, which are really considered as separators and we finish them by completing the value of the delim parameter

Example:

print(normalize_filename(u"Esto, no es válido como nombre de Archivo!", "-"))
print(normalize_filename("Tampoco?funionaría esto? eh!!!!", "-"))

The exit:

esto-no-es-valido-como-nombre-de-archivo
tampoco-funionaria-esto-eh
    
answered by 23.06.2017 / 21:41
source