Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 302 server: nginx date: Tue, 05 Aug 2025 10:50:17 GMT content-type: text/plain; charset=utf-8 content-length: 0 x-archive-redirect-reason: found capture at 20100102051333 location: https://web.archive.org/web/20100102051333/https://www.perl.com/pub/a/2006/06/01/fear-api.html server-timing: captures_list;dur=0.726082, exclusion.robots;dur=0.030670, exclusion.robots.policy;dur=0.018519, esindex;dur=0.018549, cdx.remote;dur=37.963958, LoadShardBlock;dur=534.943179, PetaboxLoader3.datanode;dur=136.009561, PetaboxLoader3.resolve;dur=124.594404 x-app-server: wwwb-app202 x-ts: 302 x-tr: 608 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=1 set-cookie: wb-p-SERVER=wwwb-app202; path=/ x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() HTTP/2 200 server: nginx date: Tue, 05 Aug 2025 10:50:19 GMT content-type: text/html; charset=ISO-8859-1 x-archive-orig-date: Sat, 02 Jan 2010 05:13:32 GMT x-archive-orig-server: Apache x-archive-orig-p3p: policyref="https://www.oreillynet.com/w3c/p3p.xml",CP="CAO DSP COR CURa ADMa DEVa TAIa PSAa PSDa IVAa IVDa CONo OUR DELa PUBi OTRa IND PHY ONL UNI PUR COM NAV INT DEM CNT STA PRE" x-archive-orig-connection: close x-archive-guessed-content-type: text/html x-archive-guessed-charset: iso-8859-1 memento-datetime: Sat, 02 Jan 2010 05:13:33 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate", ; rel="first memento"; datetime="Tue, 13 Jun 2006 05:19:04 GMT", ; rel="prev memento"; datetime="Fri, 02 Oct 2009 07:46:43 GMT", ; rel="memento"; datetime="Sat, 02 Jan 2010 05:13:33 GMT", ; rel="next memento"; datetime="Wed, 03 Feb 2010 20:41:24 GMT", ; rel="last memento"; datetime="Tue, 25 Mar 2025 15:12:03 GMT" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: 51_13_20100102015901_crawl103-c/51_13_20100102051317_crawl101.arc.gz server-timing: captures_list;dur=0.502966, exclusion.robots;dur=0.021385, exclusion.robots.policy;dur=0.013156, esindex;dur=0.010910, cdx.remote;dur=11.280471, LoadShardBlock;dur=237.808533, PetaboxLoader3.datanode;dur=600.463112, PetaboxLoader3.resolve;dur=770.946141, load_resource;dur=1183.133209 x-app-server: wwwb-app202 x-ts: 200 x-tr: 1522 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=1 x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() content-encoding: gzip perl.com: FEAR-less Site Scraping

Listen Print Discuss

FEAR-less Site Scraping

by Yung-chung Lin
June 01, 2006

Imagine that you have an assignment that you need to fetch all of the web pages of a given website, scrape data from them, and transfer the data to another place, such as a database or plain files. This is a common scenario for data scraping tasks, and CPAN has plenty of modules for this job.

While I was developing site-scraping scripts, retrieving data from some sites of the same type, I realized that I had repeated many identical or very similar code structures, such as:

  fetch_the_homepage();
  while(there_are_some_more_unfetched_links){
     foreach $link (@{links_in_the_current_page}){
         follow_link()          if $link =~ /NEXT_PAGE_OR_SOMETHING/;
         extract_product_spec() if $link =~ /PRODUCT_SPEC_PAGE/;
     }
  }

The Usual Tools

At the very beginning, I created scripts using LWP::Simple, LWP::UserAgent, and vanilla regular expressions to extract links and produce details. As the number of scripts grew, I needed more powerful resources, so I started to use WWW::Mechanize for web page fetching and Regexp::Bind, Template::Extract, HTML::LinkExtractor, Regexp::Common, etc. for data scraping. However, then I still found many redundancies.

A scraping script first needs to use essential modules for the site scraping task. Second, it may need to instantiate objects. Third, site scraping involves many interactions among different modules, mostly by passing data between them. After you fetch a page, you may need to pass the page to HTML::LinkExtractor to extract links, to Template::Extract to get detailed information, or save it to a file. You may then store extracted data in a relational database. Considering these properties, creating a site scraping script is very time-consuming, and sometimes it makes a lot of duplication.

Thus, I tried to fuse some modules together, hoping to save some of my keystrokes and simplify the coding process.

An Example using `WWW::Mechanize` and `Template::Extract`

Here's a typical site scraping script structure:

     use YAML;
     use Data::Dumper;
     use WWW::Mechanize;
     use Template::Extract;
     my $mech = WWW::Mechanize->new();
     $mech->get( "https://search.cpan.org" );
     my $ext = Template::Extract->new;
     my @result = $ext->extract($template, $mech->content);
     print Dumper \@result;
     my @link;
     foreach ($mech->links){
         if( $_->[0] =~ /foo/ ) {
            $mech->get($_->[0]);
         }
         elsif( $_->[0] =~ /bar/ ) {
            push @link;
         }
         else {
            sub { 'do something here' }->($_->[0]);
         }
     }
     print $mech->content;
     print Dumper \@link;
     foreach (@result){
        print YAML::Dump $_;
     }

This program does several things:

Fetch CPAN's homepage.
Extract data with a template.
Process links using a control structure.
Print fetched content to STDOUT.
Dump links in the page.
Use YAML to print extract results.

If you need to create just one or two temporary scripts, it is acceptable to use copy and paste to generate scripts. Things will become messy if the job is to create a hundred scripts and you still use copy and paste.

FEAR-less Site Scraping

The Usual Tools

An Example using WWW::Mechanize and Template::Extract

Tagged Articles

Recommended for You

Sponsored Resources

An Example using `WWW::Mechanize` and `Template::Extract`