Contact Us
24/7
Python BlogDjango BlogSearch for Kubernetes Big DataSearch for Kubernetes AWS BlogCloud Services

Blog

<< ALL BLOG POSTS

An Introduction to BeautifulSoup

|
December 19, 2014

Few things are less fun than parsing text, even when that text is supposed to be formatted according to certain rules (like HTML). We know the web is full of badly written markup, so the effort required to reliably extract data from it is daunting.

Save yourself a few months of work, and just use BeautifulSoup.

For a simple real-world example of its power, let’s say we have a GUI application that should display a list of links, with icons and titles, from the HTML source of any arbitrary page you give it.

First, some setup:

from os import path
from bs4 import BeautifulSoup


# a place to store the links we find
links = []

For this example, we’ll assume you’ve gotten the HTML source from https://www.python.org/ and stuffed it in a variable called page somehow. We start by turning it into beautiful soup.

soup = BeautifulSoup(page)

With that, we can very easily iterate all the links on the page. In this case, I’ll define “link” to be any <a> tag that has an href attribute set.

for link in soup.findAll('a', href=True):
    # skip useless links
    if link['href'] == '' or link['href'].startswith('#'):
        continue

Our results will be a list of tuples with three pieces of information about each link: the URL, the title, and an image. The last two are optional, and might not be present, so we start out with a (mutable) dictionary.

thisLink = {
    'url': link['href'],
    'title': link.string,
    'image': '',
}

If the <a> tag surrounds an image, we want to use it as an icon for the link.

img = link.find('img', src=True)
if img:
    thisLink['image'] = img['src']

Further, if the the image has a title or alt attribute, use that as the link’s title. If not, fall back to using the file name.

        if thisLink['title'] is None:
            # look for a title here if none exists
            if 'title' in img:
                thisLink['title'] = img['title']
            elif 'alt' in img:
                thisLink['title'] = img['alt']
            else:
                thisLink['title'] = path.basename(img['src'])

If there’s no title (meaning it wasn’t an image, and link.string was empty), try to come up with one. If we can’t, we’ll skip this link.

if thisLink['title'] is None:
    # check for text inside the link
    if len(link.contents):
        thisLink['title'] = ' '.join(link.stripped_strings)
if thisLink['title'] is None:
    # if there's *still* no title (empty tag), skip it
    continue

Now, convert what we have to a simpler, immutable tuple and add it to the list.

hashableLink = (thisLink['url'].strip(),
                thisLink['title'].strip(),
                thisLink['image'].strip())
if hashableLink not in links:
    links.append(hashableLink)

It’s that easy. For more details about the tricks used above, take a look at the official documentation.

A Full Working Example

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
output tab separated lines with the following fields:
    0 url
    1 text
    2 imageurl
"""

from os import path
from sys import stdout
import codecs
from bs4 import BeautifulSoup
import requests


streamWriter = codecs.lookup('utf-8')[-1]
stdout = streamWriter(stdout)

# a place to store the links we find
links = []

r = requests.get('https://www.python.org/')
page = r.text
soup = BeautifulSoup(page)
for link in soup.findAll('a', href=True):
    # skip useless links
    if link['href'] == '' or link['href'].startswith('#'):
        continue
    # initialize the link
    thisLink = {
        'url': link['href'],
        'title': link.string,
        'image': '',
    }
    # see if the link contains an image
    img = link.find('img', src=True)
    if img:
        thisLink['image'] = img['src']
        if thisLink['title'] is None:
            # look for a title here if none exists
            if 'title' in img:
                thisLink['title'] = img['title']
            elif 'alt' in img:
                thisLink['title'] = img['alt']
            else:
                thisLink['title'] = path.basename(img['src'])

    if thisLink['title'] is None:
        # check for text inside the link
        if len(link.contents):
            thisLink['title'] = ' '.join(link.stripped_strings)
    if thisLink['title'] is None:
        # if there's *still* no title (empty tag), skip it
        continue
    # convert to something immutable for storage
    hashableLink = (thisLink['url'].strip(),
                    thisLink['title'].strip(),
                    thisLink['image'].strip())
    # store the result
    if hashableLink not in links:
        links.append(hashableLink)

# print the results
for link in links:
    stdout.write('\t'.join(link) + '\n')

 

Was this article useful? Let us know in the comments and be sure to sign up for our Plone & Python How-To digests to receive more how-to guides as soon as they are published!

How can we assist you in reaching your objectives?
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.