CARVIEW |
Select Language
HTTP/2 302
server: nginx
date: Fri, 08 Aug 2025 11:01:48 GMT
content-type: text/plain; charset=utf-8
content-length: 0
x-archive-redirect-reason: found capture at 20100103011712
location: https://web.archive.org/web/20100103011712/https://github.com/danopia/spider
server-timing: captures_list;dur=0.919306, exclusion.robots;dur=0.028903, exclusion.robots.policy;dur=0.013440, esindex;dur=0.015437, cdx.remote;dur=68.865114, LoadShardBlock;dur=298.546069, PetaboxLoader3.datanode;dur=75.770846, PetaboxLoader3.resolve;dur=60.245335
x-app-server: wwwb-app204
x-ts: 302
x-tr: 441
server-timing: TR;dur=0,Tw;dur=0,Tc;dur=1
set-cookie: wb-p-SERVER=wwwb-app204; path=/
x-location: All
x-rl: 0
x-na: 0
x-page-cache: MISS
server-timing: MISS
x-nid: DigitalOcean
referrer-policy: no-referrer-when-downgrade
permissions-policy: interest-cohort=()
HTTP/2 200
server: nginx
date: Fri, 08 Aug 2025 11:01:49 GMT
content-type: text/html; charset=utf-8
x-archive-orig-server: nginx/0.7.61
x-archive-orig-date: Sun, 03 Jan 2010 01:17:12 GMT
x-archive-orig-connection: close
x-archive-orig-status: 200 OK
x-archive-orig-etag: "13038fe2ebf68a29c8e6b6a394096770"
x-archive-orig-x-runtime: 83ms
x-archive-orig-content-length: 22805
x-archive-orig-cache-control: private, max-age=0, must-revalidate
x-archive-guessed-content-type: text/html
x-archive-guessed-charset: utf-8
memento-datetime: Sun, 03 Jan 2010 01:17:12 GMT
link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate", ; rel="first memento"; datetime="Sun, 03 Jan 2010 01:17:12 GMT", ; rel="memento"; datetime="Sun, 03 Jan 2010 01:17:12 GMT", ; rel="next memento"; datetime="Mon, 11 Jun 2018 03:46:09 GMT", ; rel="last memento"; datetime="Thu, 19 Nov 2020 17:14:03 GMT"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org
x-archive-src: TLA-20100103004407-00085-00094-ia360915-20100113104520-00000-c/TLA-20100103010802-00126-ia360906.us.archive.org.warc.gz
server-timing: captures_list;dur=0.612092, exclusion.robots;dur=0.021677, exclusion.robots.policy;dur=0.009758, esindex;dur=0.012310, cdx.remote;dur=57.444621, LoadShardBlock;dur=527.760419, PetaboxLoader3.datanode;dur=120.919076, PetaboxLoader3.resolve;dur=872.272891, load_resource;dur=474.926462
x-app-server: wwwb-app204
x-ts: 200
x-tr: 1116
server-timing: TR;dur=0,Tw;dur=0,Tc;dur=1
x-location: All
x-rl: 0
x-na: 0
x-page-cache: MISS
server-timing: MISS
x-nid: DigitalOcean
referrer-policy: no-referrer-when-downgrade
permissions-policy: interest-cohort=()
content-encoding: gzip
danopia's spider at master - GitHub
This service is courtesy of Pledgie.
danopia / spider
- Source
- Commits
- Network (1)
- Issues (0)
- Downloads (0)
- Wiki (1)
- Graphs
-
Branch:
master
-
Branches (1)
- master ✓
- Tags (0)
Sending Request…
Pledgie Donations
Once activated, we'll place the following badge in your repository's detail box:
Spider is a Web spidering library for Ruby. It handles the robots.txt, scraping, collecting, and looping so that you can just handle the data. Copy of https://rubyforge.org/projects/spider/ . Couldn't find a git repo to fork. — Read more
spider /
README
Spider, a Web spidering library for Ruby. It handles the robots.txt, scraping, collecting, and looping so that you can just handle the data. == Examples === Crawl the Web, loading each page in turn, until you run out of memory require 'spider' Spider.start_at('https://mike-burns.com/') {} === To handle erroneous responses require 'spider' Spider.start_at('https://mike-burns.com/') do |s| s.on :failure do |a_url, resp, prior_url| puts "URL failed: #{a_url}" puts " linked from #{prior_url}" end end === Or handle successful responses require 'spider' Spider.start_at('https://mike-burns.com/') do |s| s.on :success do |a_url, resp, prior_url| puts "#{a_url}: #{resp.code}" puts resp.body puts end end === Limit to just one domain require 'spider' Spider.start_at('https://mike-burns.com/') do |s| s.add_url_check do |a_url| a_url =~ %r{^https://mike-burns.com.*} end end === Pass headers to some requests require 'spider' Spider.start_at('https://mike-burns.com/') do |s| s.setup do |a_url| if a_url =~ %r{^https://.*wikipedia.*} headers['User-Agent'] = "Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)" end end end === Use memcached to track cycles require 'spider' require 'spider/included_in_memcached' SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211'] Spider.start_at('https://mike-burns.com/') do |s| s.check_already_seen_with IncludedInMemcached.new(SERVERS) end === Track cycles with a custom object require 'spider' class ExpireLinks < Hash def <<(v) self[v] = Time.now end def include?(v) self[v].kind_of?(Time) && (self[v] + 86400) >= Time.now end end Spider.start_at('https://mike-burns.com/') do |s| s.check_already_seen_with ExpireLinks.new end === Store nodes to visit with Amazon SQS require 'spider' require 'spider/next_urls_in_sqs' Spider.start_at('https://mike-burns.com') do |s| s.store_next_urls_with NextUrlsInSQS.new(AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY) end ==== Store nodes to visit with a custom object require 'spider' class MyArray < Array def pop super end def push(a_msg) super(a_msg) end end Spider.start_at('https://mike-burns.com') do |s| s.store_next_urls_with MyArray.new end === Create a URL graph require 'spider' nodes = {} Spider.start_at('https://mike-burns.com/') do |s| s.add_url_check {|a_url| a_url =~ %r{^https://mike-burns.com.*} } s.on(:every) do |a_url, resp, prior_url| nodes[prior_url] ||= [] nodes[prior_url] << a_url end end === Use a proxy require 'net/http_configuration' require 'spider' http_conf = Net::HTTP::Configuration.new(:proxy_host => '7proxies.org', :proxy_port => 8881) http_conf.apply do Spider.start_at('https://img.4chan.org/b/') do |s| s.on(:success) do |a_url, resp, prior_url| File.open(a_url.gsub('/',':'),'w') do |f| f.write(resp.body) end end end end == Author John Nagro john.nagro@gmail.com Mike Burns https://mike-burns.com mike@mike-burns.com (original author) Many thanks to: Matt Horan Henri Cook Sander van der Vliet John Buckley Brian Campbell With `robot_rules' from James Edward Gray II via https://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589
This feature is coming soon. Sit tight!