Few things are less fun than parsing text, even when that text is supposed to be formatted according to certain rules (like HTML). We know the web is full of badly written markup, so the effort required to reliably extract data from it is daunting.
Save yourself a few months of work, and just use BeautifulSoup.
For a simple real-world example of its power, let’s say we have a GUI application that should display a list of links, with icons and titles, from the HTML source of any arbitrary page you give it.
First, some setup:
from os import path from bs4 import BeautifulSoup # a place to store the links we find links = []
For this example, we’ll assume you’ve gotten the HTML source from https://www.python.org/ and stuffed it in a variable called page
somehow. We start by turning it into beautiful soup.
soup = BeautifulSoup(page)
With that, we can very easily iterate all the links on the page. In this case, I’ll define “link” to be any <a>
tag that has an href
attribute set.
for link in soup.findAll('a', href=True): # skip useless links if link['href'] == '' or link['href'].startswith('#'): continue
Our results will be a list of tuples with three pieces of information about each link: the URL, the title, and an image. The last two are optional, and might not be present, so we start out with a (mutable) dictionary.
thisLink = { 'url': link['href'], 'title': link.string, 'image': '', }
If the <a>
tag surrounds an image, we want to use it as an icon for the link.
img = link.find('img', src=True) if img: thisLink['image'] = img['src']
Further, if the the image has a title
or alt
attribute, use that as the link’s title. If not, fall back to using the file name.
if thisLink['title'] is None: # look for a title here if none exists if 'title' in img: thisLink['title'] = img['title'] elif 'alt' in img: thisLink['title'] = img['alt'] else: thisLink['title'] = path.basename(img['src'])
If there’s no title (meaning it wasn’t an image, and link.string
was empty), try to come up with one. If we can’t, we’ll skip this link.
if thisLink['title'] is None: # check for text inside the link if len(link.contents): thisLink['title'] = ' '.join(link.stripped_strings) if thisLink['title'] is None: # if there's *still* no title (empty tag), skip it continue
Now, convert what we have to a simpler, immutable tuple and add it to the list.
hashableLink = (thisLink['url'].strip(), thisLink['title'].strip(), thisLink['image'].strip()) if hashableLink not in links: links.append(hashableLink)
It’s that easy. For more details about the tricks used above, take a look at the official documentation.
#!/usr/bin/env python # -*- encoding: utf-8 -*- """ output tab separated lines with the following fields: 0 url 1 text 2 imageurl """ from os import path from sys import stdout import codecs from bs4 import BeautifulSoup import requests streamWriter = codecs.lookup('utf-8')[-1] stdout = streamWriter(stdout) # a place to store the links we find links = [] r = requests.get('https://www.python.org/') page = r.text soup = BeautifulSoup(page) for link in soup.findAll('a', href=True): # skip useless links if link['href'] == '' or link['href'].startswith('#'): continue # initialize the link thisLink = { 'url': link['href'], 'title': link.string, 'image': '', } # see if the link contains an image img = link.find('img', src=True) if img: thisLink['image'] = img['src'] if thisLink['title'] is None: # look for a title here if none exists if 'title' in img: thisLink['title'] = img['title'] elif 'alt' in img: thisLink['title'] = img['alt'] else: thisLink['title'] = path.basename(img['src']) if thisLink['title'] is None: # check for text inside the link if len(link.contents): thisLink['title'] = ' '.join(link.stripped_strings) if thisLink['title'] is None: # if there's *still* no title (empty tag), skip it continue # convert to something immutable for storage hashableLink = (thisLink['url'].strip(), thisLink['title'].strip(), thisLink['image'].strip()) # store the result if hashableLink not in links: links.append(hashableLink) # print the results for link in links: stdout.write('\t'.join(link) + '\n')
Was this article useful? Let us know in the comments and be sure to sign up for our Plone & Python How-To digests to receive more how-to guides as soon as they are published!