Screen Scraping HTML

14 minute read Published: 2005-04-06

We've all found useful information on the web. Occassionally, its even necessary to retrieve that information in an automated fashion. It could be just for your own amusement, possibly a new web service that hasn't yet published an API, or even a critical business partner who only exposes a web based interface to you.

Of course, screen scraping web pages is not the optimal solution to any problem, and I highly advise you to look into APIs or formal web services that will provide a more consistent and intentional programming interface. Potential problems could arise for a number of reasons.

Step 0 : Considerations

Most obvious and annoying problem is you are not guaranteed any form of consistency in the presentation of your data. Websites are under construction constantly. Even when they look the same, programmers and designers are behind the scenes tweaking little pieces to optimize, straighten, or update. This means that your data is likely to move or disappear entirely. As you can imagine, this can lead to erroneous data or your program failing to complete.

A problem that you might not think of immediately is the impact of your screen scraping on the target's web server. During the development phase especially, you should give serious thought the mirroring the website using any number of mirroing applications available on the web. This will protect against you accidentally Denial of Servicing the target's web site. Once you move to production, out of common courtesy, you should limit the running of your program to as few times as possible to provide you with the accuracy your required. Obviously, if this is a business-to-business transaction, you should keep the other guy in the loop. It won't be good for your business relationships should you trip the other companies Intrusion Detection System and then have to explain what you're to a defensive security administrator.

Along the same lines, consider the legality of the screen scraping. To a web server, your traffic could masquerade as 100% interactive, valid traffic, but upon closer inspection, a wise system administrator will likely put the pieces together. Search that companies website for "Acceptable Use Policies" and "Terms of Service." In some cases, they may not apply but it's likely that the privilege to access the data is granted only after agreeing to one of the two aforementioned documents.

Step 1 : Research

At this point, it's necessary to dive into the task at hand. Go through the motions manually in a web browser that supports thorough debugging. My experience with Firefox has always been a positive one. Through the use of tools like the DOM Inspector, the built in Javascript Debugger, and extensions like Web Developer, View Source With .., and Venkman its been one of the best platforms for web development I've encountered. Incidentally, the elements of web design are critical to the automated extraction of that data. There are two phases to debug to write a good screen scraper.

The Request

A web server is not a mind reader, it has to know what you're after. HTTP Requests tell the web server what document to serve and how to serve it. The request can be issued through the address bar, a form, or a link. As you navigate the site, take note of the parameters passed in the Query String of the URL. If you need to login, use the Web Developer Extension to "Display Form Details" and take note of the names of the login prompt and the form objects themselves. Also, its important to take note of the "METHOD" the form is going to use, either "GET" or "POST". As you go through, sketch out the process on a scrap piece of paper with details on the parameters along the way. If you're clicking on links to get where you need, use the right click option of "View Link Properties" to get details.

A key thing people often miss when doing web automation is the effect of client side scripting. You can use Venkman to step through the entire run of client side code. You want to pay attention to hidden form fields that are often set "onClick" of the submit button, or through other types of normal user interaction. Without knowing and setting these hidden fields to the correct value, the page will refuse to load or cause problems. Granted, this isn't good practice on the site designer's part as a growing number of security aware web surfers are limiting, or disabling client side scripting entirely.

The Response

After sketching out the path to your data, you've finally arrived at the page that contains the data itself. You now need to map out the page in a way that your data can be identified from the rest of the insignificant details, styling, and advertisements! I've always believed in syntax highlighting and have become accustomed to vim's flavor of highlighting. I've got the View Source With .. Extension configured to use gvim. So I right click and with any luck, the page source is displayed in the gvim buffer with syntax highlighting enabled. If the page has a weird extension, or no extension, I might have to "set syntax=html" if its not presenting the proper page headers. Search through the source file, correlating the visual representations in the browser with the source code that's generating them. You'll need to find landmarks in the HTML to use as a means to guide your parser through an obscure landscape of markup language. If you're having problems, another indispensible tool provided by Firefox is the "View Selection Source". To use it, simply highlight some content and then right click -> "View Selection Source". A Mozilla Source viewer opens with just the HTML that generated the selected content highlighted with some surrounding HTML to provide context.

You're going to have to start thinking like a machine. Think Simple, 1's and 0's, true and false! I usually start at my data and start working back, looking for a unique tag or pattern that I can use to locate the data moving forward. Look not only at the HTML Elements (<b>,<td>, etc) but at their attributes (color="#FF000",colspan="3") to profile the areas containing and surrounding your data.

The lay of the land is changing these days. It should be getting much easier to treat HTML as a data source thanks Web Standards and the alarming number of web designers pushing whole-heartedly for their adoption. The old table based layouts, styled by font tags and animated GIFs is giving way to "Document Object Model" aware design and styling fueled mostly by Cascading Style Sheets (CSS). CSS works most effectively when the document layout emulates an object. There are "classes", "ids", and tags establish relationships. CSS makes it trivial for Web Designers with passion and experience in Design Arts, to cooperate with Web Programmers whose passion is the Art of Programming and whose idea of "progressive design" is white text on a black background! The cues that Programmers and Designers specify to insure interoperability of Content and Presentation gives the Screen Scraper a legible road map by which to extract their data. If you see "div", "span", "tbody", "theader" elements bearing attributes like "class" and "id" favor using these elements as landmarks. Though nothing is guaranteed, it's much more likely that these elements will maintain their relationships as they're often the result of divisional cooperation than entropy.

One of the simplest ways to keep your bearing is to print out the section of HTML you're targetting, and sketch out some simple logic to be able to quickly identify it. I use a highlighter and a red pen to make notes on the print out that I can glance at as a sanity check.

Step 2 : Automated Retrieval of Your Content

Depending on how complicated the path to your data, there are a number of tools available. Basic "GET" method requests that don't require cookies, session management, or form tracking can take advantage of the simple interface provided by the LWP::Simple package.

use strict;
use LWP::Simple;

my $url = q|http://www.weather.com/weather/local/21224|;

my $content = get $url;

print $content;

That's it. Simple.

More complex problems with cookies and login's will require a more sophisticated tool. WWW::Mechanize offers a simple a solution to a complex path to your data with the ability to store cookies and construct form objects that can intelligently initialize themselves. An example:

use strict;
use WWW::Mechanize;

my $authPage = q|http://www.weather.com|;
my $authForm = 'whatwhere';
my %formVars = (
    where   => '21224',
    what    => 'Weather36HourUndeclared'
);

#
# or optionally, set the fields in visible order
my @visible = qw(21224);

#
# Create a "bot"
my $bot = new WWW::Mechanize();

#
# Masquerade as Mac Firefox
$bot->agent_alias('Mac Mozilla');

#
# Retrieve the page with our "login form"
$bot->get($authPage);

#
# fill out the form!
$bot->form_name($authForm);

while( my ($k,$v) = each %formVars ) {
    $bot->field($k,$v);
}
#
# OR
# $bot->set_visible(@visible);

#
# submit the form!
$bot->submit();

#
# Print the Content
print $bot->content();

Step 3 : Data Processing

There are two main ways to parse markup languages like HTML, XHTML, and XML. I've always preferred dealing with the "Event Driven" methodology. Essentially, as the document is parsed, new tags trigger events in the code, calling functions you've defined with the attributes of the tag included as arguments. The content between a start and end tag is handled through another callback function that you've defined. This method requires that you build your own data structures. The second method parses the entire document, building a tree like object from it which it then returns to the programmer as an object. This second method is very useful when you have to process an entire document, modify its contents and then transform it back into markup language. Usually, a screen scraping program cares very little for the "entire document" and more for the interesting tidbits, everything else can be ignored.

HTML::Parser

HTML::Parser is an event driven HTML parser module available on CPAN. Using the above content retrieval code snippet, delete the "print $bot->content();" line, and insert this code, with "use" statements at the top for consistency.

use HTML::Parser;

#
# store the content;
my $content = $bot->content();

#
# variables for use in our parsing sub routines:
my $grabText = undef;
my $textStr = '';

#
# Parser Engine
my $parser = new HTML::Parser(
                start_h => [ \&tagStart, "tagname, attr" ],
                end_h   => [ \&tagStop, "tagname" ],
                text_h  => [ \&handleText, "dtext" ]
);

#
# Call the parser!
$parser->parse($content);

#
# Display the results between the tag
print $textStr;

#
# Handle the start tag
sub tagStart {
        my ($tagname,$attr) = @_;
        if((lc $tagname eq 'b') && $attr->{class} eq 'obsTempTextA') {
                $grabText = 1;
        }
}

#
# Handle the end tag
sub tagStop {   $grabText = undef; }

#
# check to see if we're grabbing the text;
sub handleText {
        $textStr .= shift if $grabText;
}

Using this, its simple to extract the temperature from the variable $textStr. If you wanted to extract more information, you could use a more complex data structure to hold all the variables. The important thing to remember about the event based model is everything happens linearly. It's good practice to keep state, either through a simple scalar, like the $grabText var above, or in an array or hash. If you're dealing with data that's nested in several layers of tags, you might consider something like this:

my @nestedTags = ();

sub tagStart {
    my ($tag,$attr) = @_;

    if($tag eq $tagWeAreLookingFor) {
        push @nestedTags,$tag;
    }
}

sub handleText {
    my $text = shift;

    #
    # In here, we can check where in the @nestedTag array we are, and do
    # different things based on location
    if(scalar @nestedTags == 4) {
        print "Four Tags deep, we found: $text!\n";
    }
}

sub tagStop {
    my $tag = shift;
    pop @nestedTags if $tag eq $tagWeAreLookingFor;
}

This model works great for most screen scraping as we're usually interested in key pieces of data on a page byh page basis. However, this can quickly turn your program into a mess of handler subroutines and complex tracking variables that make managing your screen scraper closer to voodoo than programming. Thankfully, HTML::Parser is fully prepared to make our lives easier by supporting subclassing.

Step 4 : SubClassing for Sanity

I usually like to have 1 subclassed HTML::Parser class per page. In that class I'll include accessors to the relevant data on that page. That way, I can just "use" my class where I'm processing the data for that one page and I can keep the main program relatively clean from unnecessary clutter.

The following script, uses a simple interface to pull down the current temperature in Fahrenheit. The accessor method allows the user to specify the units they'd like the temperature back in.

#!/usr/bin/perl

use strict;
use LWP::Simple;

use MyParsers::Weather::Current;

my $parser = new MyParsers::Weather::Current;

my $content = get 'http://www.weather.com/weather/local/21224';

$parser->parse($content);

print $parser->getTemperature, " degrees fahrenheit.\n";
print $parser->getTemperature('celsius'), " degrees celsius.\n";
print $parser->getTemperature('kelvin'), " degrees kelvin.\n";

The script uses a homemade module "MyParsers::Weather::Current" to handle all the parsing. The code for that module is provided below.

package MyParsers::Weather::Current;

use strict;
use HTML::Parser;

#
# Inherit
our @ISA = qw(HTML::Parser);

my %ExtraVariables = (
    _found		=> undef,
    _grabText	=> undef,
    temp_F		=> undef,
    temp_C		=> undef
);

#
# Class Functions
sub new {
    #
    # Call the Parent Constructor
    my $self = HTML::Parser::new(@_);
    #
    # Call our local initialization function
    $self->_init();
    return $self;
}

#
# Internal Init Function to Setup the Parser.
sub _init {
    my $self = shift;
    #
    # init() is provided by the parent class
    $self->init(
        start_h	=>  [ \&_handler_tagStart, 'self, tagname, attr' ],
        end_h	=>  [ \&_handler_tagStop, 'self, tagname' ],
        text_h	=>  [ \&_handler_text, 'self, dtext' ],
    );

    #
    # Set up the rest of the object
    foreach my $k (keys %ExtraVariables) {
        $self->{$k} = $ExtraVariables{$k};
    }
}

#
# Accessors
sub getTemperature {
    my ($self,$type) = @_;

    unless( $self->{_found} ) {
        print STDERR "either you forgot to call parse, or the temp data was
    not found!\n";
        return;
    }
    $type = 'fahrenheit' unless length $type;

    #
    # Remove the first character from the temperature string
    my $t = 'temp_' . uc substr($type,0,1);

    return $self->{$t} if exists $self->{$t};

    print STDERR "Unknown Temperature Type ($type) !\n";
    return undef;
}

#
# Parsing Functions
sub _handler_tagStart {
    my ($self,$tag,$attr) = @_;
    if((lc $tag eq 'b') && $attr->{class} eq 'obsTempTextA') {
        $self->{_grabText} = 1;
        $self->{_found} = 1;
    }
}

sub _handler_tagStop {
    my $self = shift;
    $self->{_grabText} = undef;
}

sub _handler_text {
    my ($self,$text) = @_;
    if($self->{_grabText}) {
        if(my($temp,$forc) = ($text =~ /(\d+).*([CF])/)) {
            if($forc eq 'C') {
                $self->{temp_C} = $temp;
                #
                # Fahrenheit doesn't really make decimals places useful
                $self->{temp_F} = int((9/5) * ($temp+32));
            }
            elsif($forc eq 'F') {
                $self->{temp_F} = $temp;
                #
                # Use precision to 2 decimal places
                $self->{temp_C} = sprintf("%.2f", (5/9) * ($temp-32));
            }
        }
    }
}

Wrapping Up

HTML can be an incredibly effective transport mechanism for data, even if the original author hadn't intended it to be that way. With the advent of Web Services and Standards Compliant designs utilizing Cascading Style Sheets, its becoming more and more interoperable and cooperative. Learning to use screen scraping techniques can provide a wealth of information for the programmer to analyze and format to their heart's content.

As an exercise, you might want to expand on the MyParsers::Weather::Current object to pull additional information from weather.com's page, and add a few more accessors! If you'd really like a challenge, it'd be kind of fun to write a parser for each of the major weather sites, pull the data for forecasting down, and use a weighted average based on the individual sites accuracy in the past to get an "educated guess" at the weather conditions!