Skip to main content

Python and HTML Processing

Popularity Report

Total Popularity Score: 0

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Rank

Bookmark History

Saved by 25 people (-4 private), first by anonymouse user on 2006-07-27


Public Sticky notes

Fetching standard Web pages over HTTP is very easy with Python:

import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen("http://www.python.org")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Highlighted by imouthesmp

Supplying Data

Sometimes, it is necessary to pass information to the Web server, such as information which would come from an HTML form. Of course, you need to know which fields are available in a form, but assuming that you already know this, you can supply such data in the urlopen function call:

# Search the Vaults of Parnassus for "XMLForms".
# First, encode the data.
data = urllib.urlencode({"find" : "XMLForms", "findtype" : "t"})
# Now get that file-like object again, remembering to mention the data.
f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py", data)
# Read the results back.
s = f.read()
s.close()

Highlighted by imouthesmp

Various Web surfing tasks that I regularly perform could be made much easier, and less tedious, if I could only use Python to fetch the HTML pages and to process them, yielding the information I really need. In this document I attempt to describe HTML processing in Python using readily available tools and libraries.

Highlighted by reckoner

ng a Parser Class

First of all, let us define a new class inheriting from SGMLParser with a convenience method that I find very convenient indeed:

import sgmllib

class M

Highlighted by gialloporpora