Analyze HTML in Python with BeautifulSoup

Question

Analyze HTML in Python with BeautifulSoup

Navigation

#1 by (1 votes)

0

Objective: I'm trying to show a list about the specific names of the website https://www.screwfix.com/c/tools/angle-grinders/cat830694 .

For example:

Get Titan TTB281GRD of the title of the link in this part:

<div id="product_box_14" class="lg-12 md-24 sm-24 cols">
  <div id="productID_93905" class="lii lii--j2 lii__offer">

    <div class="lii_head">
      <h3 class="lii__title">
        <a id="product_description_14" href="https://www.screwfix.com/p/titan-ttb281grd-750w-4-angle-grinder-230-240v/93905" descriptionproductid="93905" title='Titan TTB281GRD 750W 4½&#034;  Angle Grinder 230-240V'>
          Titan TTB281GRD 750W 4½&#034;  Angle Grinder 230-240V
        </a>

        <span id="product_quoteNo_14" quotenumberproductid="93905">
          (93905)
        </span>
      </h3>
    </div>
  </div>
</div>

and get Makita DGA456Z of this analogous part:

<div id="product_box_1" class="lg-12 md-24 sm-24 cols">
  <div id="productID_2906R" class="lii lii--j2 lii__offer">

    <div class="lii_head">
      <h3 class="lii__title">
        <a id="product_description_1" href="https://www.screwfix.com/p/makita-dga456z-18v-li-ion-4-brushless-cordless-angle-grinder-bare/2906r" descriptionproductid="2906R" title='Makita DGA456Z 18V Li-Ion  4½&#034; Brushless Cordless Angle Grinder - Bare'>
          Makita DGA456Z 18V Li-Ion  4½&#034; Brushless Cordless Angle Grinder - Bare
        </a>

        <span id="product_quoteNo_1" quotenumberproductid="2906R">
          (2906R)
        </span>
      </h3>
    </div>
  </div>
</div>

Description: You should get the values in the variable "título" ( class = "lii_head" class = "lii__title" and then within the variable "title =" )

Code: My program downloads the HTML correctly, and I manage to filter well the parts that I want to take, but when it comes to wanting to get the " title " it returns an empty list.

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests

URL = "https://www.screwfix.com/c/tools/angle-grinders/cat830694"

# Realizamos la petición a la web
req = requests.get(URL)

# Comprobamos que la petición nos devuelve un Status Code = 200
status_code = req.status_code
if status_code == 200:

    # Pasamos el contenido HTML de la web a un objeto BeautifulSoup()
    html = BeautifulSoup(req.text, "html.parser")
    #print html

    # Obtenemos todos los divs donde están las entradas
    entradas = html.find_all('h3', {'class': 'lii__title'})
    #print entradas

    # Recorremos todas las entradas para extraer el título, autor y fecha
    for i, entrada in enumerate(entradas):

    print entrada
        # Con el método "getText()" no nos devuelve el HTML
        titulo = entrada.find_all('a', {'title'})

        # Imprimo el Título, Autor y Fecha de las entradas
        print  (i + 1, titulo)

else:
    print "Status Code %d" % status_code

html python beautifulsoup

asked by Pijolysss 12.12.2017 в 23:10

source

1 answer

FileUpload error Size in file [0] How to receive a .txt file from the Firebase Storage and read it in android studio

score 1 · Accepted Answer

You were very good. To get the attribute of an element (in this case title in <a title="..."> , you can treat an element as a dictionary:

enlace = entrada.find('a')
titulo = enlace['title']

However, you should use .get() , to avoid errors if an element does not have that attribute.

titulo = enlace.get('title')

But I show you a simpler way to access the elements that interest you in HTML, using CSS selectors . This is the simplest way to scrap with BeautifulSoup.

In your case:

DIV class = "lii_head"
  [with a direct son] H3 class = "lii__title"
  [with a direct son] A

The .select() method accepts a CSS selector. This saves us from chaining searches.

html.select('div.lii_head > h3.lii__title > a')

Code:

from bs4 import BeautifulSoup
import requests

URL = "https://www.screwfix.com/c/tools/angle-grinders/cat830694"

# Solicitud web
req = requests.get(URL)

# Comprobamos que la petición nos devuelve un Status Code = 200
status_code = req.status_code
if status_code == 200:

    #Armamos el DOM con BeautifulSoup
    html = BeautifulSoup(req.text, "html.parser")

    #Selector CSS
    entradas = html.select('div.lii_head > h3.lii__title > a')
    for i, entrada in enumerate(entradas):
        #Obtenemos el atributo "title"
        titulo = entrada.get('title')
        print("%d == %s" % (i, titulo.encode('utf-8')))

else:
    print ("Status Code %d" % status_code)

Result:

0 == b'Makita DGA456Z 18V Li-Ion  4\xc2\xbd" Brushless Cordless Angle Grinder - Bare'
1 == b'DeWalt DCG412N 18V Li-Ion XR 5"  Angle Grinder - Bare'
2 == b'Bosch GWS 710 700W 4\xc2\xbd"  Angle Grinder 230V'
3 == b'DeWalt DWE4206-GB 1010W 4\xc2\xbd"  Angle Grinder 240V'

... etc. (they are 20 lines)

Demo:

link