CARVIEW |
Select Language
HTTP/2 200
date: Fri, 25 Jul 2025 05:44:50 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
x-robots-tag: none
etag: W/"867adc1400dce23c07efa6db04f84a78"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=sMHmJpSR8H%2BDxwEcH6ofmUg%2FpExGsjnqTrg15nCaEDcz%2F6Jw0m1xzRQHNF%2B842SHJDXT%2Fv9Tn74KtUO8pESQ83i5I0UVMZzfBAl4DjsYNsNKVUwkh3vMStT0dce%2BI%2Bjkl9XD3F5Yircu%2BkSvqyX%2B5E%2F5irnYLrSd6iwrCmnHF%2BLwCjVzBCVwWPuIC%2BGxhqU30noTI%2Bykf9rvIcM%2FLeAHHlt2YtPdM9ndOoFOjbQvRbQvUkCS7WkTO1CSThq8EId4fXUZBaCK3s0tuRhAbCxC3A%3D%3D--yS2GCReikhUj50d5--zu1JrzBUEBz%2F0ilcowzgRg%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1995835348.1753422289; Path=/; Domain=github.com; Expires=Sat, 25 Jul 2026 05:44:49 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Sat, 25 Jul 2026 05:44:49 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: B600:2D5B28:29CF40:36447E:688319D1
Scraping HTML · timrdf/csv2rdf4lod-automation Wiki · GitHub
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 36
Scraping HTML
Tim L edited this page Jul 17, 2015
·
46 revisions
::sigh::
First, a nice article about just using the web as an API.
Other's work:
- https://scraperwiki.com
- Python-based parser: BeautifulSoup
-
https://web-xslt.googlecode.com/svn/trunk/htmlparse/htmlparse.xsl parses an HTML string into a DOM object.
- It's functions are in
xmlns:d="data:,dpc"
.
- It's functions are in
- See also XSL Crib Sheet, xargs Cheat Sheet
- Scraping can benefit from SDV organization.
This page lists some XSL utility functions that we've developed to scrape HTML:
- html:text - get all of the displayable text within a DOM element's hierarchy.
-
html:anchor-labels - get all of the anchors' displayable text, delimited by
||
. -
html:anchor-hrefs - get all of the anchors' hrefs, delimited by
||
. - html:parse-value
The following functions help scrape HTML elements into useful strings. It uses the following namespace.
xmlns:html="https://www.w3.org/1999/xhtml"
We prefer to just produce a CSV from the HTML, instead of trying to model it in RDF directly. There are much nicer mechanisms in csv2rdf4lod to handle URI creation within the SDV paradigm. We write a row of CSV using the following.
<xsl:value-of select="concat($DQ,string-join((
$perigee,$apogee,$inclination,$period,$semi-major-axis,
),
concat($DQ,',',$DQ)),$DQ,$NL)"/>
https://www.darpa.mil/OpenCatalog/index.html circa Feb 2014
<tr>
<td>Aptima Inc.</td>
<td>
<a href='https://www.darpa.mil/External_Link.aspx?url=https://github.com/Aptima/pattern-matching'>Network
Query by Example</a>
</td>
<td>Analytics</td>
<td>2014-07</td>
<td>https://github.com/Aptima/pattern-matching.git</td>
<td>
<a href='stats/pattern-matching/index.html'>stats</a>
</td>
<td>Hadoop MapReduce-over-Hive based implementation of network
query by example utilizing attributed network pattern
matching.</td>
<td>ALv2</td>
</tr>
https://hcil2.cs.umd.edu/newvarepository/benchmarks.php
Definition:
<!-- https://github.com/timrdf/csv2rdf4lod-automation/wiki/Scraping-HTML#htmltext -->
<xsl:function name="html:text">
<xsl:param name="node"/>
<xsl:variable name="together">
<xsl:for-each select="$node//text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Usage:
<xsl:template match="html:tr">
<xsl:value-of select="concat(html:text(html:td[1]),$NL)"/>
</xsl:template>
Adding a parameter for a delimiter:
<xsl:function name="html:text">
<xsl:param name="node"/>
<xsl:param name="delim"/>
<xsl:variable name="together">
<xsl:for-each select="$node//text()">
<xsl:value-of select="concat(normalize-space(.),$delim)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Usage:
<xsl:template match="html:tr">
<xsl:value-of select="concat(html:text(html:td[1],' '),$NL)"/>
</xsl:template>
Uses:
- July 2014 GRDDL svg (added the delimiter)
- May 27 2014 hcil-cs-umd-edu/visual-analytics-benchmark-repository (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/pubs.xsl (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 2013 n2yo-com/satellites/src/html2csv.xsl (shown above)
- Dec 4 13:12 2013 n2yo-com/satellite-categories/src/category2csv.xsl (same as shown)
- Dec 3 16:45 2013 n2yo-com/satellite-categories/src/index2csv.xsl (same as shown)
- Dec 1 19:06 2013 n2yo-com/browse/src/html2csv.xsl (same as shown)
Definition:
<!-- https://github.com/timrdf/csv2rdf4lod-automation/wiki/Scraping-HTML#htmlanchor-labels -->
<xsl:function name="html:anchor-labels">
<xsl:param name="anchors"/>
<xsl:variable name="together">
<xsl:for-each select="$anchors">
<xsl:if test="position() gt 1">
<xsl:value-of select="'||'"/>
</xsl:if>
<xsl:value-of select="normalize-space(.)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Uses:
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 n2yo-com/satellites/src/html2csv.xsl (shown above)
Definition:
<!-- https://github.com/timrdf/csv2rdf4lod-automation/wiki/Scraping-HTML#htmlanchor-hrefs -->
<xsl:function name="html:anchor-hrefs">
<xsl:param name="anchors"/>
<xsl:param name="base"/>
<xsl:variable name="together">
<xsl:for-each select="$anchors">
<xsl:if test="position() gt 1">
<xsl:value-of select="'||'"/>
</xsl:if>
<xsl:value-of select="concat($base,normalize-space(@href))"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="normalize-space($together)"/>
</xsl:function>
Uses:
- Aug 24 2014 freeformatter-com/mime-types-list/src/html2csv.xsl
- May 27 2014 hcil-cs-umd-edu/visual-analytics-benchmark-repository (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/pubs.xsl (same as shown)
- Feb 12 18:14 2014 darpa-mil/open-catalog/src/software.xsl (same as shown)
- Dec 5 09:26 n2yo-com/satellites/src/html2csv.xsl (shown above)
- Dec 4 13:12 n2yo-com/satellite-categories/src/category2csv.xsl (same as shown)
- Dec 3 16:45 n2yo-com/satellite-categories/src/index2csv.xsl (same as shown)
Uses:
- n2yo-com/satellites/src/html2csv.xsl
Definition:
<xsl:function name="html:capitalize">
<xsl:param name="string"/>
<xsl:value-of select="concat(upper-case(substring($string,1,1)),
substring($string, 2))"/>
</xsl:function>
Clone this wiki locally
You can’t perform that action at this time.