CARVIEW |
Select Language
HTTP/2 302
server: nginx
date: Fri, 08 Aug 2025 03:39:17 GMT
content-type: text/plain; charset=utf-8
content-length: 0
x-archive-redirect-reason: found capture at 20090201054014
location: https://web.archive.org/web/20090201054014/https://github.com/why/hpricot/tree/
server-timing: captures_list;dur=0.620026, exclusion.robots;dur=0.021361, exclusion.robots.policy;dur=0.009775, esindex;dur=0.013890, cdx.remote;dur=10.164030, LoadShardBlock;dur=470.095856, PetaboxLoader3.datanode;dur=103.254296, PetaboxLoader3.resolve;dur=239.941865
x-app-server: wwwb-app216
x-ts: 302
x-tr: 506
server-timing: TR;dur=0,Tw;dur=0,Tc;dur=1
set-cookie: wb-p-SERVER=wwwb-app216; path=/
x-location: All
x-rl: 0
x-na: 0
x-page-cache: MISS
server-timing: MISS
x-nid: DigitalOcean
referrer-policy: no-referrer-when-downgrade
permissions-policy: interest-cohort=()
HTTP/2 301
server: nginx
date: Fri, 08 Aug 2025 03:39:18 GMT
content-type: text/html; charset=utf-8
content-length: 107
x-archive-orig-server: nginx/0.6.26
x-archive-orig-date: Sun, 01 Feb 2009 05:40:14 GMT
x-archive-orig-connection: close
x-archive-orig-status: 301 Moved Permanently
location: https://web.archive.org/web/20090201054014/https://github.com/why/hpricot/tree/master
x-archive-orig-x-runtime: 78ms
x-archive-orig-cache-control: no-cache
x-archive-orig-content-length: 107
cache-control: max-age=1800
memento-datetime: Sun, 01 Feb 2009 05:40:14 GMT
link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate", ; rel="first memento"; datetime="Mon, 29 Dec 2008 22:02:02 GMT", ; rel="prev memento"; datetime="Mon, 29 Dec 2008 22:02:02 GMT", ; rel="memento"; datetime="Sun, 01 Feb 2009 05:40:14 GMT", ; rel="next memento"; datetime="Sat, 28 Feb 2009 12:08:18 GMT", ; rel="last memento"; datetime="Sat, 26 Sep 2009 08:16:27 GMT"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org
x-archive-src: 52_8_20090201051303_crawl101-c/52_8_20090201053828_crawl101.arc.gz
server-timing: captures_list;dur=0.512661, exclusion.robots;dur=0.023812, exclusion.robots.policy;dur=0.010451, esindex;dur=0.009973, cdx.remote;dur=6.655212, LoadShardBlock;dur=426.971016, PetaboxLoader3.datanode;dur=248.544322, PetaboxLoader3.resolve;dur=289.066445, load_resource;dur=199.906064
x-app-server: wwwb-app216
x-ts: 301
x-tr: 658
server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0
x-location: All
x-rl: 0
x-na: 0
x-page-cache: MISS
server-timing: MISS
x-nid: DigitalOcean
referrer-policy: no-referrer-when-downgrade
permissions-policy: interest-cohort=()
HTTP/2 200
server: nginx
date: Fri, 08 Aug 2025 03:39:18 GMT
content-type: text/html; charset=utf-8
x-archive-orig-server: nginx/0.6.26
x-archive-orig-date: Sun, 01 Feb 2009 05:40:14 GMT
x-archive-orig-connection: close
x-archive-orig-status: 200 OK
x-archive-orig-x-runtime: 83ms
x-archive-orig-etag: "08c1fb11ee8a17e89edbc09ad8d387d4"
x-archive-orig-cache-control: private, max-age=0, must-revalidate
x-archive-orig-content-length: 28960
x-archive-guessed-content-type: text/html
x-archive-guessed-charset: utf-8
memento-datetime: Sun, 01 Feb 2009 05:40:14 GMT
link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate", ; rel="first memento"; datetime="Sun, 14 Sep 2008 16:44:47 GMT", ; rel="prev memento"; datetime="Thu, 29 Jan 2009 20:30:38 GMT", ; rel="memento"; datetime="Sun, 01 Feb 2009 05:40:14 GMT", ; rel="next memento"; datetime="Wed, 04 Feb 2009 05:19:01 GMT", ; rel="last memento"; datetime="Sat, 12 Jul 2025 00:53:17 GMT"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org
x-archive-src: 52_8_20090201051303_crawl101-c/52_8_20090201053828_crawl101.arc.gz
server-timing: captures_list;dur=0.428963, exclusion.robots;dur=0.015477, exclusion.robots.policy;dur=0.007136, esindex;dur=0.008427, cdx.remote;dur=13.289152, LoadShardBlock;dur=208.178915, PetaboxLoader3.datanode;dur=125.245905, PetaboxLoader3.resolve;dur=177.775025, load_resource;dur=107.209526
x-app-server: wwwb-app216
x-ts: 200
x-tr: 378
server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0
x-location: All
x-rl: 0
x-na: 0
x-page-cache: MISS
server-timing: MISS
x-nid: DigitalOcean
referrer-policy: no-referrer-when-downgrade
permissions-policy: interest-cohort=()
content-encoding: gzip
why's hpricot at master - GitHub
This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (

This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (

Fork of manveru/hpricot | |
Description: | A swift, liberal HTML parser with a fantastic library |
Homepage: | https://code.whytheluckystiff.net/hpricot/ |
Clone URL: |
git://github.com/why/hpricot.git
Give this clone URL to anyone.
git clone git://github.com/why/hpricot.git
|
hpricot /
name | age | message | |
---|---|---|---|
![]() |
CHANGELOG | Fri Jun 15 15:32:25 -0700 2007 | * Rakefile: prepping for 0.6 release. [why] |
![]() |
COPYING | Mon Jul 03 18:17:08 -0700 2006 | * ext/hpricot_scan: yeay, i got an html scanne... [why] |
![]() |
README | Mon Jun 04 22:58:55 -0700 2007 | * lib/hpricot/elements.rb: added block syntax ... [why] |
![]() |
Rakefile | Wed Dec 10 13:21:52 -0800 2008 | * Rakefile: run ragel task when building exten... [why] |
![]() |
ext/ | Sun Dec 07 04:33:42 -0800 2008 | Merge branch 'master' of git://github.com/why/h... [coderrr] |
![]() |
extras/ | Thu Aug 10 20:13:14 -0700 2006 | * lib/hpricot/elements.rb: use `contents` to g... [why] |
![]() |
lib/ | Tue Nov 25 20:07:17 -0800 2008 | * ext/hpricot_scan/hpricot_css.rl: hand the cs... [why] |
![]() |
setup.rb | Thu Nov 23 08:38:43 -0800 2006 | * setup.rb: for installing in site_ruby. [why] |
![]() |
test/ | Sun Dec 07 04:33:42 -0800 2008 | Merge branch 'master' of git://github.com/why/h... [coderrr] |
= Hpricot, Read Any HTML Hpricot is a fast, flexible HTML parser written in C. It's designed to be very accommodating (like Tanaka Akira's HTree) and to have a very helpful library (like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS parser, in fact, is based on John Resig's JQuery. Also, Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out. You know, that sort of thing. *Please read this entire document* before making assumptions about how this software works. == An Overview Let's clear up what Hpricot is. # Hpricot is *a standalone library*. It requires no other libraries. Just Ruby! # While priding itself on speed, Hpricot *works hard to sort out bad HTML* and pays a small penalty in order to get that right. So that's slightly more important to me than speed. # *If you can see it in Firefox, then Hpricot should parse it.* That's how it should be! Let me know the minute it's otherwise. # Primarily, Hpricot is used for reading HTML and tries to sort out troubled HTML by having some idea of what good HTML is. Some people still like to use Hpricot for XML reading, but *remember to use the Hpricot::XML() method* for that! == The Hpricot Kingdom First, here are all the links you need to know: * https://code.whytheluckystiff.net/hpricot is the Hpricot wiki and bug tracker. Go there for news and recipes and patches. It's the center of activity. * https://code.whytheluckystiff.net/svn/hpricot/trunk is the main Subversion repository for Hpricot. You can get the latest code there. * https://code.whytheluckystiff.net/doc/hpricot is the home for the latest copy of this reference. * See COPYING for the terms of this software. (Spoiler: it's absolutely free.) If you have any trouble, don't hesitate to contact the author. As always, I'm not going to say "Use at your own risk" because I don't want this library to be risky. If you trip on something, I'll share the liability by repairing things as quickly as I can. Your responsibility is to report the inadequacies. == Installing Hpricot You may get the latest stable version from Rubyforge. Win32 binaries and source gems are available. $ gem install hpricot As Hpricot is still under active development, you can also try the most recent candidate build here: $ gem install hpricot --source https://code.whytheluckystiff.net The development gem is usually in pretty good shape actually. You can also get the bleeding edge code or plain Ruby tarballs on the wiki. == An Hpricot Showcase We're going to run through a big pile of examples to get you jump-started. Many of these examples are also found at https://code.whytheluckystiff.net/hpricot/wiki/HpricotBasics, in case you want to add some of your own. === Loading Hpricot Itself You have probably got the gem, right? To load Hpricot: require 'rubygems' require 'hpricot' If you've installed the plain source distribution, go ahead and just: require 'hpricot' === Load an HTML Page The <tt>Hpricot()</tt> method takes a string or any IO object and loads the contents into a document object. doc = Hpricot("<p>A simple <b>test</b> string.</p>") To load from a file, just get the stream open: doc = open("index.html") { |f| Hpricot(f) } To load from a web URL, use <tt>open-uri</tt>, which comes with Ruby: require 'open-uri' doc = open("https://qwantz.com/") { |f| Hpricot(f) } Hpricot uses an internal buffer to parse the file, so the IO will stream properly and large documents won't be loaded into memory all at once. However, the parsed document object will be present in memory, in its entirety. === Search for Elements Use <tt>Doc.search</tt>: doc.search("//p[@class='posted']") #=> #<Hpricot:Elements[{p ...}, {p ...}]> <tt>Doc.search</tt> can take an XPath or CSS expression. In the above example, all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt> attribute of <tt>"posted"</tt>. A shortcut is to use the divisor: (doc/"p.posted") #=> #<Hpricot:Elements[{p ...}, {p ...}]> === Finding Just One Element If you're looking for a single element, the <tt>at</tt> method will return the first element matched by the expression. In this case, you'll get back the element itself rather than the <tt>Hpricot::Elements</tt> array. doc.at("body")['onload'] The above code will find the body tag and give you back the <tt>onload</tt> attribute. This is the most common reason to use the element directly: when reading and writing HTML attributes. === Fetching the Contents of an Element Just as with browser scripting, the <tt>inner_html</tt> property can be used to get the inner contents of an element. (doc/"#elementID").inner_html #=> "..<b>contents</b>.." If your expression matches more than one element, you'll get back the contents of ''all the matched elements''. So you may want to use <tt>first</tt> to be sure you get back only one. (doc/"#elementID").first.inner_html #=> "..<b>contents</b>.." === Fetching the HTML for an Element If you want the HTML for the whole element (not just the contents), use <tt>to_html</tt>: (doc/"#elementID").to_html #=> "<div id='elementID'>...</div>" === Looping All searches return a set of <tt>Hpricot::Elements</tt>. Go ahead and loop through them like you would an array. (doc/"p/a/img").each do |img| puts img.attributes['class'] end === Continuing Searches Searches can be continued from a collection of elements, in order to search deeper. # find all paragraphs. elements = doc.search("/html/body//p") # continue the search by finding any images within those paragraphs. (elements/"img") #=> #<Hpricot::Elements[{img ...}, {img ...}]> Searches can also be continued by searching within container elements. # find all images within paragraphs. doc.search("/html/body//p").each do |para| puts "== Found a paragraph ==" pp para imgs = para.search("img") if imgs.any? puts "== Found #{imgs.length} images inside ==" end end Of course, the most succinct ways to do the above are using CSS or XPath. # the xpath version (doc/"/html/body//p//img") # the css version (doc/"html > body > p img") # ..or symbols work, too! (doc/:html/:body/:p/:img) === Looping Edits You may certainly edit objects from within your search loops. Then, when you spit out the HTML, the altered elements will show. (doc/"span.entryPermalink").each do |span| span.attributes['class'] = 'newLinks' end puts doc This changes all <tt>span.entryPermalink</tt> elements to <tt>span.newLinks</tt>. Keep in mind that there are often more convenient ways of doing this. Such as the <tt>set</tt> method: (doc/"span.entryPermalink").set(:class => 'newLinks') === Figuring Out Paths Every element can tell you its unique path (either XPath or CSS) to get to the element from the root tag. The <tt>css_path</tt> method: doc.at("div > div:nth(1)").css_path #=> "div > div:nth(1)" doc.at("#header").css_path #=> "#header" Or, the <tt>xpath</tt> method: doc.at("div > div:nth(1)").xpath #=> "/div/div:eq(1)" doc.at("#header").xpath #=> "//div[@id='header']" == Hpricot Fixups When loading HTML documents, you have a few settings that can make Hpricot more or less intense about how it gets involved. == :fixup_tags Really, there are so many ways to clean up HTML and your intentions may be to keep the HTML as-is. So Hpricot's default behavior is to keep things flexible. Making sure to open and close all the tags, but ignore any validation problems. As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt to shift the document's tags to meet XHTML 1.0 Strict. doc = open("index.html") { |f| Hpricot f, :fixup_tags => true } This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's going to move the paragraph below the link. Or up and out of other elements where paragraphs don't belong. If an unknown element is found, it is ignored. Again, <tt>:fixup_tags</tt>. == :xhtml_strict So, let's go beyond just trying to fix the hierarchy. The <tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML 1.0 Strict document. Even at the cost of removing elements that get in the way. doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true } What measures does <tt>:xhtml_strict</tt> take? 1. Shift elements into their proper containers just like :fixup_tags. 2. Remove unknown elements. 3. Remove unknown attributes. 4. Remove illegal content. 5. Alter the doctype to XHTML 1.0 Strict. == Hpricot.XML() The last option is the <tt>:xml</tt> option, which makes some slight variations on the standard mode. The main difference is that :xml mode won't try to output tags which are friendlier for browsers. For example, if an opening and closing <tt>br</tt> tag is found, XML mode won't try to turn that into an empty element. XML mode also doesn't downcase the tags and attributes for you. So pay attention to case, friends. The primary way to use Hpricot's XML mode is to call the Hpricot.XML method: doc = open("https://redhanded.hobix.com/index.xml") do |f| Hpricot.XML(f) end *Also, :fixup_tags is canceled out by the :xml option.* This is because :fixup_tags makes assumptions based how HTML is structured. Specifically, how tags are defined in the XHTML 1.0 DTD.
This feature is coming soon. Sit tight!