FEAR-less Site Scraping
by Yung-chung LinJune 01, 2006
Imagine that you have an assignment that you need to fetch all of the web pages of a given website, scrape data from them, and transfer the data to another place, such as a database or plain files. This is a common scenario for data scraping tasks, and CPAN has plenty of modules for this job.
While I was developing site-scraping scripts, retrieving data from some sites of the same type, I realized that I had repeated many identical or very similar code structures, such as:
fetch_the_homepage();
while(there_are_some_more_unfetched_links){
foreach $link (@{links_in_the_current_page}){
follow_link() if $link =~ /NEXT_PAGE_OR_SOMETHING/;
extract_product_spec() if $link =~ /PRODUCT_SPEC_PAGE/;
}
}
The Usual Tools
At the very beginning, I created scripts using LWP::Simple
, LWP::UserAgent
, and vanilla regular expressions to extract links and produce details. As the number of scripts grew, I needed more powerful resources, so I started to use WWW::Mechanize
for web page fetching and Regexp::Bind
, Template::Extract
, HTML::LinkExtractor
, Regexp::Common
, etc. for data scraping. However, then I still found many redundancies.
A scraping script first needs to use essential modules for the site scraping task. Second, it may need to instantiate objects. Third, site scraping involves many interactions among different modules, mostly by passing data between them. After you fetch a page, you may need to pass the page to HTML::LinkExtractor
to extract links, to Template::Extract
to get detailed information, or save it to a file. You may then store extracted data in a relational database. Considering these properties, creating a site scraping script is very time-consuming, and sometimes it makes a lot of duplication.
Thus, I tried to fuse some modules together, hoping to save some of my keystrokes and simplify the coding process.
An Example using WWW::Mechanize
and Template::Extract
Here's a typical site scraping script structure:
use YAML;
use Data::Dumper;
use WWW::Mechanize;
use Template::Extract;
my $mech = WWW::Mechanize->new();
$mech->get( "https://search.cpan.org" );
my $ext = Template::Extract->new;
my @result = $ext->extract($template, $mech->content);
print Dumper \@result;
my @link;
foreach ($mech->links){
if( $_->[0] =~ /foo/ ) {
$mech->get($_->[0]);
}
elsif( $_->[0] =~ /bar/ ) {
push @link;
}
else {
sub { 'do something here' }->($_->[0]);
}
}
print $mech->content;
print Dumper \@link;
foreach (@result){
print YAML::Dump $_;
}
This program does several things:
- Fetch CPAN's homepage.
- Extract data with a template.
- Process links using a control structure.
- Print fetched content to
STDOUT
. - Dump links in the page.
- Use YAML to print extract results.
If you need to create just one or two temporary scripts, it is acceptable to use copy and paste to generate scripts. Things will become messy if the job is to create a hundred scripts and you still use copy and paste.
![]() |
Related Reading Spidering Hacks |
