Analyze HTML in Python with BeautifulSoup

0

Objective: I'm trying to show a list about the specific names of the website https://www.screwfix.com/c/tools/angle-grinders/cat830694 .

For example:

  • Get Titan TTB281GRD of the title of the link in this part:

    <div id="product_box_14" class="lg-12 md-24 sm-24 cols">
      <div id="productID_93905" class="lii lii--j2 lii__offer">
    
        <div class="lii_head">
          <h3 class="lii__title">
            <a id="product_description_14" href="https://www.screwfix.com/p/titan-ttb281grd-750w-4-angle-grinder-230-240v/93905" descriptionproductid="93905" title='Titan TTB281GRD 750W 4½&#034;  Angle Grinder 230-240V'>
              Titan TTB281GRD 750W 4½&#034;  Angle Grinder 230-240V
            </a>
    
            <span id="product_quoteNo_14" quotenumberproductid="93905">
              (93905)
            </span>
          </h3>
        </div>
      </div>
    </div>
    
  • and get Makita DGA456Z of this analogous part:

    <div id="product_box_1" class="lg-12 md-24 sm-24 cols">
      <div id="productID_2906R" class="lii lii--j2 lii__offer">
    
        <div class="lii_head">
          <h3 class="lii__title">
            <a id="product_description_1" href="https://www.screwfix.com/p/makita-dga456z-18v-li-ion-4-brushless-cordless-angle-grinder-bare/2906r" descriptionproductid="2906R" title='Makita DGA456Z 18V Li-Ion  4½&#034; Brushless Cordless Angle Grinder - Bare'>
              Makita DGA456Z 18V Li-Ion  4½&#034; Brushless Cordless Angle Grinder - Bare
            </a>
    
            <span id="product_quoteNo_1" quotenumberproductid="2906R">
              (2906R)
            </span>
          </h3>
        </div>
      </div>
    </div>
    
  • Description: You should get the values in the variable "título" ( class = "lii_head" class = "lii__title" and then within the variable "title =" )

    Code: My program downloads the HTML correctly, and I manage to filter well the parts that I want to take, but when it comes to wanting to get the " title " it returns an empty list.

    # -*- coding: utf-8 -*-
    
    from bs4 import BeautifulSoup
    import requests
    
    URL = "https://www.screwfix.com/c/tools/angle-grinders/cat830694"
    
    # Realizamos la petición a la web
    req = requests.get(URL)
    
    # Comprobamos que la petición nos devuelve un Status Code = 200
    status_code = req.status_code
    if status_code == 200:
    
        # Pasamos el contenido HTML de la web a un objeto BeautifulSoup()
        html = BeautifulSoup(req.text, "html.parser")
        #print html
    
        # Obtenemos todos los divs donde están las entradas
        entradas = html.find_all('h3', {'class': 'lii__title'})
        #print entradas
    
        # Recorremos todas las entradas para extraer el título, autor y fecha
        for i, entrada in enumerate(entradas):
    
        print entrada
            # Con el método "getText()" no nos devuelve el HTML
            titulo = entrada.find_all('a', {'title'})
    
            # Imprimo el Título, Autor y Fecha de las entradas
            print  (i + 1, titulo)
    
    else:
        print "Status Code %d" % status_code
    
        
    asked by Pijolysss 13.12.2017 в 00:10
    source

    1 answer

    1

    You were very good. To get the attribute of an element (in this case title in <a title="..."> , you can treat an element as a dictionary:

    enlace = entrada.find('a')
    titulo = enlace['title']
    

    However, you should use .get() , to avoid errors if an element does not have that attribute.

    titulo = enlace.get('title')
    


    But I show you a simpler way to access the elements that interest you in HTML, using CSS selectors . This is the simplest way to scrap with BeautifulSoup.

    In your case:

      

    DIV class = "lii_head"
      [with a direct son] H3 class = "lii__title"
      [with a direct son] A

    The .select() method accepts a CSS selector. This saves us from chaining searches.

    html.select('div.lii_head > h3.lii__title > a')
    


    Code:

    from bs4 import BeautifulSoup
    import requests
    
    URL = "https://www.screwfix.com/c/tools/angle-grinders/cat830694"
    
    # Solicitud web
    req = requests.get(URL)
    
    # Comprobamos que la petición nos devuelve un Status Code = 200
    status_code = req.status_code
    if status_code == 200:
    
        #Armamos el DOM con BeautifulSoup
        html = BeautifulSoup(req.text, "html.parser")
    
        #Selector CSS
        entradas = html.select('div.lii_head > h3.lii__title > a')
        for i, entrada in enumerate(entradas):
            #Obtenemos el atributo "title"
            titulo = entrada.get('title')
            print("%d == %s" % (i, titulo.encode('utf-8')))
    
    else:
        print ("Status Code %d" % status_code)
    


    Result:

    0 == b'Makita DGA456Z 18V Li-Ion  4\xc2\xbd" Brushless Cordless Angle Grinder - Bare'
    1 == b'DeWalt DCG412N 18V Li-Ion XR 5"  Angle Grinder - Bare'
    2 == b'Bosch GWS 710 700W 4\xc2\xbd"  Angle Grinder 230V'
    3 == b'DeWalt DWE4206-GB 1010W 4\xc2\xbd"  Angle Grinder 240V'
    

    ... etc. (they are 20 lines)

    Demo:

    link

        
    answered by 15.12.2017 / 20:00
    source