CARVIEW

MOTORHOMES

Select Language

HTTP/2 302 server: nginx date: Sun, 18 Jan 2026 01:23:23 GMT content-type: text/plain; charset=utf-8 content-length: 0 x-archive-redirect-reason: found capture at 20070724060839 location: https://web.archive.org/web/20070724060839/https://crawler.dev.java.net/ server-timing: captures_list;dur=0.783527, exclusion.robots;dur=0.054708, exclusion.robots.policy;dur=0.039058, esindex;dur=0.015780, cdx.remote;dur=5.438039, LoadShardBlock;dur=106.230941, PetaboxLoader3.datanode;dur=50.009124, PetaboxLoader3.resolve;dur=35.433058 x-app-server: wwwb-app210-dc8 x-ts: 302 x-tr: 140 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 set-cookie: wb-p-SERVER=wwwb-app210; path=/ x-location: All x-as: 14061 x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() HTTP/2 200 server: nginx date: Sun, 18 Jan 2026 01:23:23 GMT content-type: text/html;charset=UTF-8 x-archive-orig-date: Tue, 24 Jul 2007 06:08:45 GMT x-archive-orig-server: Apache x-archive-orig-x-powered-by: Servlet 2.4; JBoss-4.0.4.GA (build: CVSTag=JBoss_4_0_4_GA date=200605151000)/Tomcat-5.5 x-archive-orig-pragma: x-archive-orig-cache-control: private,max-age=0,must-revalidate x-archive-orig-helmloginid: guest x-archive-orig-connection: close x-archive-guessed-content-type: text/html x-archive-guessed-charset: ibm852 memento-datetime: Tue, 24 Jul 2007 06:08:39 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: IA-AROUND-THE-WORLD-2007-20070724050317-20018-crawling06-c/IA-AROUND-THE-WORLD-2007-20070724060752-08151-crawling01.us.archive.org.arc.gz server-timing: captures_list;dur=0.780304, exclusion.robots;dur=0.032343, exclusion.robots.policy;dur=0.013357, esindex;dur=0.015784, cdx.remote;dur=132.930410, LoadShardBlock;dur=198.757629, PetaboxLoader3.datanode;dur=151.697058, PetaboxLoader3.resolve;dur=286.159919, load_resource;dur=271.884799 x-app-server: wwwb-app210-dc8 x-ts: 200 x-tr: 660 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 x-location: All x-as: 14061 x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() content-encoding: gzip crawler: crawler.dev.java.net

CollabNet Enterprise Edition

My pages	Projects	Communities	java.net

Projects > javatools > crawler

Get Involved

Project tools

Project home

Issue tracker

How do I...

crawler
Project home

If you were registered and logged in, you could join this project.

Summary	Smart and Simple Web Crawler
Categories	None
License	Apache Software License
Owner(s)	ltorunski

Smart and Simple Web Crawler

Overview

What is the Smart and Simple Web Crawler?

Smart and easy framework thats crawls a web site
Integrated Lucene support
It's simple to integrate the framework in own applications
The crawler can start from one or from a list of links
Two crawling models available:

Max Iterations: Crawls a web site through a limited number of links: Fast model with a small memory footprint and cpu usage.
Max Depth: A simple graph model parser without recording in and outcoming links. Fast as the max interations model.

Accept filter interface to limit the links to be crawled
Core accept filters available: ServerFilter, BeginningPathFilter and RegularExpressionFilter
Combining the accept filters with AND, OR and NOT possible
Plugable http connection libraries HttpClient (default) and HTMLParser (optional)
Own listeners can be added in the parsing process
The framework is not a GUI based tool to mirror a website and browse the site offline!

License: Apache License Version 2.0, January 2004
Requirements

Java 1.4
crawler.jar
commons-httpclient-3.0.1.jar, commons-codec-1.3.jar and commons-logging-1.1.jar

Release Notes

Version 1.0.0 of the open-source project Crawler was released on 17th December 2006. This release supports authentication schemes in the DownloadHelper and a connection manager in the SimpleHtmlParser. More details can be found in the change log.

Tutorial

Installation and Configuration

Put the crawler.jar and the dependencies commons-httpclient-3.0.1.jar, commons-codec-1.3.jar and commons-logging-1.1.jar to your classpath.
No other configuration is needed. See the examples.
For Lucene 2.0.0 support add lucene-core-2.0.0.jar and lucene-demos-2.0.0.jar (HTMLDocument) to the classpath.
If the SimpleHtmlParser is used, the htmlparser.jar of HTMLParser has to be added to the classpath.
If you use the MultiThreadedCrawler, you will have to add the backport-util-concurrent-2.2.jar to the classpath.

JavaDoc API
Examples

Download

Download the current release

Help wanted!

If you want to write a GUI add-on to configure and run the framework or to write some features of the roadmap, please let me know.
The GUI should be used to analyze the site structure (creating a sitemap, showing all outgoing and internal links etc.)

Imprint

Lars Torunski
Blumenthalstraße 79
50668 Cologne, Germany
Email: crawler [at] torunski [dot] com

Original Source | Taken Source