CARVIEW

MOTORHOMES

Select Language

HTTP/2 302 server: nginx date: Fri, 08 Aug 2025 11:01:48 GMT content-type: text/plain; charset=utf-8 content-length: 0 x-archive-redirect-reason: found capture at 20100103011712 location: https://web.archive.org/web/20100103011712/https://github.com/danopia/spider server-timing: captures_list;dur=0.919306, exclusion.robots;dur=0.028903, exclusion.robots.policy;dur=0.013440, esindex;dur=0.015437, cdx.remote;dur=68.865114, LoadShardBlock;dur=298.546069, PetaboxLoader3.datanode;dur=75.770846, PetaboxLoader3.resolve;dur=60.245335 x-app-server: wwwb-app204 x-ts: 302 x-tr: 441 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=1 set-cookie: wb-p-SERVER=wwwb-app204; path=/ x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() HTTP/2 200 server: nginx date: Fri, 08 Aug 2025 11:01:49 GMT content-type: text/html; charset=utf-8 x-archive-orig-server: nginx/0.7.61 x-archive-orig-date: Sun, 03 Jan 2010 01:17:12 GMT x-archive-orig-connection: close x-archive-orig-status: 200 OK x-archive-orig-etag: "13038fe2ebf68a29c8e6b6a394096770" x-archive-orig-x-runtime: 83ms x-archive-orig-content-length: 22805 x-archive-orig-cache-control: private, max-age=0, must-revalidate x-archive-guessed-content-type: text/html x-archive-guessed-charset: utf-8 memento-datetime: Sun, 03 Jan 2010 01:17:12 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate", ; rel="first memento"; datetime="Sun, 03 Jan 2010 01:17:12 GMT", ; rel="memento"; datetime="Sun, 03 Jan 2010 01:17:12 GMT", ; rel="next memento"; datetime="Mon, 11 Jun 2018 03:46:09 GMT", ; rel="last memento"; datetime="Thu, 19 Nov 2020 17:14:03 GMT" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: TLA-20100103004407-00085-00094-ia360915-20100113104520-00000-c/TLA-20100103010802-00126-ia360906.us.archive.org.warc.gz server-timing: captures_list;dur=0.612092, exclusion.robots;dur=0.021677, exclusion.robots.policy;dur=0.009758, esindex;dur=0.012310, cdx.remote;dur=57.444621, LoadShardBlock;dur=527.760419, PetaboxLoader3.datanode;dur=120.919076, PetaboxLoader3.resolve;dur=872.272891, load_resource;dur=474.926462 x-app-server: wwwb-app204 x-ts: 200 x-tr: 1116 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=1 x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() content-encoding: gzip danopia's spider at master - GitHub

danopia / spider

Branches (1)
- master ✓
Tags (0)

Spider is a Web spidering library for Ruby. It handles the robots.txt, scraping, collecting, and looping so that you can just handle the data. Copy of https://rubyforge.org/projects/spider/ . Couldn't find a git repo to fork. — Read more

https://rubyforge.org/projects/spider/

This URL has Read+Write access

Ok, so maybe changing that wasn't smart.

danopia (author)

Fri Aug 07 16:23:34 -0700 2009

commit 649d3460f6703e2a4f59d7666c133d15d6502151
tree 20432bb9a54689cfd37edf2c9543882752df9efa
parent a548f1c7dab536d4e745c95b59451cd2514dfea3

spider /

	name	age	history message
	CHANGES		Loading commit data...
	README
	lib/
	spec/
	spider.gemspec

README

Spider, a Web spidering library for Ruby. It handles the robots.txt,
scraping, collecting, and looping so that you can just handle the data.
== Examples
=== Crawl the Web, loading each page in turn, until you run out of memory
 require 'spider'
 Spider.start_at('https://mike-burns.com/') {}
=== To handle erroneous responses
 require 'spider'
 Spider.start_at('https://mike-burns.com/') do |s|
   s.on :failure do |a_url, resp, prior_url|
     puts "URL failed: #{a_url}"
     puts " linked from #{prior_url}"
   end
 end
=== Or handle successful responses
 require 'spider'
 Spider.start_at('https://mike-burns.com/') do |s|
   s.on :success do |a_url, resp, prior_url|
     puts "#{a_url}: #{resp.code}"
     puts resp.body
     puts
   end
 end
=== Limit to just one domain
 require 'spider'
 Spider.start_at('https://mike-burns.com/') do |s|
   s.add_url_check do |a_url|
     a_url =~ %r{^https://mike-burns.com.*}
   end
 end
=== Pass headers to some requests
 require 'spider'
 Spider.start_at('https://mike-burns.com/') do |s|
   s.setup do |a_url|
     if a_url =~ %r{^https://.*wikipedia.*}
       headers['User-Agent'] = "Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)"
     end
   end
 end
=== Use memcached to track cycles
 require 'spider'
 require 'spider/included_in_memcached'
 SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
 Spider.start_at('https://mike-burns.com/') do |s|
   s.check_already_seen_with IncludedInMemcached.new(SERVERS)
 end
=== Track cycles with a custom object
 require 'spider'
 class ExpireLinks < Hash
   def <<(v)
     self[v] = Time.now
   end
   def include?(v)
     self[v].kind_of?(Time) && (self[v] + 86400) >= Time.now
   end
 end
 Spider.start_at('https://mike-burns.com/') do |s|
   s.check_already_seen_with ExpireLinks.new
 end
=== Store nodes to visit with Amazon SQS
 require 'spider'
 require 'spider/next_urls_in_sqs'
 Spider.start_at('https://mike-burns.com') do |s|
   s.store_next_urls_with NextUrlsInSQS.new(AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY)
 end
==== Store nodes to visit with a custom object
 require 'spider'
 class MyArray < Array
   def pop
  super
   end
  
   def push(a_msg)
     super(a_msg)
   end
 end
 Spider.start_at('https://mike-burns.com') do |s|
   s.store_next_urls_with MyArray.new
 end
=== Create a URL graph
 require 'spider'
 nodes = {}
 Spider.start_at('https://mike-burns.com/') do |s|
   s.add_url_check {|a_url| a_url =~ %r{^https://mike-burns.com.*} }
   s.on(:every) do |a_url, resp, prior_url|
     nodes[prior_url] ||= []
     nodes[prior_url] << a_url
   end
 end
=== Use a proxy
 require 'net/http_configuration'
 require 'spider'
 http_conf = Net::HTTP::Configuration.new(:proxy_host => '7proxies.org',
                                          :proxy_port => 8881)  
 http_conf.apply do
   Spider.start_at('https://img.4chan.org/b/') do |s|
     s.on(:success) do |a_url, resp, prior_url|
       File.open(a_url.gsub('/',':'),'w') do |f|
         f.write(resp.body)
       end
     end
   end
 end
== Author
John Nagro john.nagro@gmail.com
Mike Burns https://mike-burns.com mike@mike-burns.com (original author)
Many thanks to:
Matt Horan
Henri Cook
Sander van der Vliet
John Buckley
Brian Campbell
With `robot_rules' from James Edward Gray II via
https://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589

Original Source | Taken Source

danopia / spider

Pledgie Donations