HTTP/2 302
server: nginx
date: Sun, 18 Jan 2026 01:23:23 GMT
content-type: text/plain; charset=utf-8
content-length: 0
x-archive-redirect-reason: found capture at 20070724060839
location: https://web.archive.org/web/20070724060839/https://crawler.dev.java.net/
server-timing: captures_list;dur=0.783527, exclusion.robots;dur=0.054708, exclusion.robots.policy;dur=0.039058, esindex;dur=0.015780, cdx.remote;dur=5.438039, LoadShardBlock;dur=106.230941, PetaboxLoader3.datanode;dur=50.009124, PetaboxLoader3.resolve;dur=35.433058
x-app-server: wwwb-app210-dc8
x-ts: 302
x-tr: 140
server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0
set-cookie: wb-p-SERVER=wwwb-app210; path=/
x-location: All
x-as: 14061
x-rl: 0
x-na: 0
x-page-cache: MISS
server-timing: MISS
x-nid: DigitalOcean
referrer-policy: no-referrer-when-downgrade
permissions-policy: interest-cohort=()
HTTP/2 200
server: nginx
date: Sun, 18 Jan 2026 01:23:23 GMT
content-type: text/html;charset=UTF-8
x-archive-orig-date: Tue, 24 Jul 2007 06:08:45 GMT
x-archive-orig-server: Apache
x-archive-orig-x-powered-by: Servlet 2.4; JBoss-4.0.4.GA (build: CVSTag=JBoss_4_0_4_GA date=200605151000)/Tomcat-5.5
x-archive-orig-pragma:
x-archive-orig-cache-control: private,max-age=0,must-revalidate
x-archive-orig-helmloginid: guest
x-archive-orig-connection: close
x-archive-guessed-content-type: text/html
x-archive-guessed-charset: ibm852
memento-datetime: Tue, 24 Jul 2007 06:08:39 GMT
link:
; rel="original",
; rel="timemap"; type="application/link-format",
; rel="timegate"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org
x-archive-src: IA-AROUND-THE-WORLD-2007-20070724050317-20018-crawling06-c/IA-AROUND-THE-WORLD-2007-20070724060752-08151-crawling01.us.archive.org.arc.gz
server-timing: captures_list;dur=0.780304, exclusion.robots;dur=0.032343, exclusion.robots.policy;dur=0.013357, esindex;dur=0.015784, cdx.remote;dur=132.930410, LoadShardBlock;dur=198.757629, PetaboxLoader3.datanode;dur=151.697058, PetaboxLoader3.resolve;dur=286.159919, load_resource;dur=271.884799
x-app-server: wwwb-app210-dc8
x-ts: 200
x-tr: 660
server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0
x-location: All
x-as: 14061
x-rl: 0
x-na: 0
x-page-cache: MISS
server-timing: MISS
x-nid: DigitalOcean
referrer-policy: no-referrer-when-downgrade
permissions-policy: interest-cohort=()
content-encoding: gzip
crawler: crawler.dev.java.net
Get Involved
Project tools
How do I...
crawler
Project home
If you were registered and logged in , you could join this project.
Smart and Simple Web Crawler
Overview
What is the Smart and Simple Web Crawler?
Smart and easy framework thats crawls a web site
Integrated Lucene support
It's simple to integrate the framework in own applications
The crawler can start from one or from a list of links
Two crawling models available:
Max Iterations: Crawls a web site through a limited number of links: Fast model with a small memory footprint and cpu usage.
Max Depth: A simple graph model parser without recording in and outcoming links. Fast as the max interations model.
Accept filter interface to limit the links to be crawled
Core accept filters available: ServerFilter, BeginningPathFilter and RegularExpressionFilter
Combining the accept filters with AND, OR and NOT possible
Plugable http connection libraries HttpClient (default) and HTMLParser (optional)
Own listeners can be added in the parsing process
The framework is not a GUI based tool to mirror a website and browse the site offline!
License: Apache License Version 2.0, January 2004
Requirements
Java 1.4
crawler.jar
commons-httpclient-3.0.1.jar, commons-codec-1.3.jar and commons-logging-1.1.jar
Release Notes
Version 1.0.0 of the open-source project Crawler was released on 17th December 2006. This release supports authentication schemes in the DownloadHelper and a connection manager in the SimpleHtmlParser. More details can be found in the change log.
Tutorial
Installation and Configuration
Put the crawler.jar and the dependencies commons-httpclient-3.0.1.jar, commons-codec-1.3.jar and commons-logging-1.1.jar to your classpath.
No other configuration is needed. See the examples.
For Lucene 2.0.0 support add lucene-core-2.0.0.jar and lucene-demos-2.0.0.jar (HTMLDocument) to the classpath.
If the SimpleHtmlParser is used, the htmlparser.jar of HTMLParser has to be added to the classpath.
If you use the MultiThreadedCrawler, you will have to add the backport-util-concurrent-2.2.jar to the classpath.
JavaDoc API
Examples
Download
Help wanted!
If you want to write a GUI add-on to configure and run the framework or to write some features of the roadmap , please let me know.
The GUI should be used to analyze the site structure (creating a sitemap, showing all outgoing and internal links etc.)
Imprint