CARVIEW |
Select Language
HTTP/2 200
date: Wed, 30 Jul 2025 15:21:17 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
x-robots-tag: none
etag: W/"06ec9d45cc9f1b49565090948bdebac3"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=9yJwuGZk2evpYUF5idY4bK%2Ba4YJm0Ck%2FZtId2jwIyDIkvN303NeIVM0csetLhd15yNkid4vMZ%2FgBpTrnnPvsGQc3Xah0jNtUrzd%2FJxs%2BtKO8ZmNE80GydcvK4DPnMS2K8rLNAScfFm3duZNxkCay%2FFpAdzCVc2LwWW4GJb2nfLfAMJn5CRfjlzs9RcqFicmeFrEOy090b1b54qJ6QlsU9XjL%2BF%2FkG0yMX3GdMLbl5pVL3l6Vbzt%2FMK2OC1Hkw0ZNdjQbZ2uHISR36A5B22i4pA%3D%3D--VWKpRlhjW3h9dtXW--lbQqS9BUzFx2%2Bv8gdB1uJA%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1402791172.1753888877; Path=/; Domain=github.com; Expires=Thu, 30 Jul 2026 15:21:17 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 30 Jul 2026 15:21:17 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: A900:32B107:1105:1538:688A386D
Getting Started · antchfx/antch Wiki · GitHub
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 41
Getting Started
zhengchun edited this page Dec 29, 2017
·
4 revisions
go get github.com/antchfx/antch
type item struct {
Title string `json:"title"`
Link string `json:"link"`
Desc string `json:"desc"`
}
Create a struct called dmozSpider
that implement Handler
interface.
type dmozSpider struct {}
func (s *dmozSpider) ServeSpider(c chan<- antch.Item, res *http.Response) {}
dmozSpider
will extracting data from received pages and pass data into Pipeline.
doc, err := antch.ParseHTML(res)
for _, node := range htmlquery.Find(doc, "//div[@id='site-list-content']/div") {
v := new(item)
v.Title = htmlquery.InnerText(htmlquery.FindOne(node, "//div[@class='site-title']"))
v.Link = htmlquery.SelectAttr(htmlquery.FindOne(node, "//a"), "href")
v.Desc = htmlquery.InnerText(htmlquery.FindOne(node, "//div[contains(@class,'site-descr')]"))
c <- v
}
htmlquery package, that supports XPath expression extracting data,
and then send Item toGo'Channel c
.
c <- v
Create new Pipeline called jsonOutputPipeline
, implements PipelineHandler
interface.
jsonOutputPipeline
serialize received Item data as JSON format print into console.
type jsonOutputPipeline struct {}
func (p *jsonOutputPipeline) ServePipeline(v Item) {
b, err := json.Marshal(v)
if err != nil {
panic(err)
}
os.Stdout.Write(b)
}
Create a new web crawler instance.
crawler := antch.NewCrawler()
You can enables middleware for HTTP cookies or robots.txt if you want.
- enable cookies middleware for web crawler.
crawler.UseCookies()
- you even registers custom middleware for web crawler.
crawler.UseMiddleware(CustomMiddleware())
Register dmozSpider
to the web crawler instance.
dmozSpider
will process all matches pages if its matches by dmoztools.net
pattern.
crawler.Handle("dmoztools.net", &dmozSpider{})
Register jsonOutputPipeline
to the web crawler instance.
crawler.UsePipeline(newTrimSpacePipeline(), newJsonOutputPipeline())
startURLs := []string{
"https://dmoztools.net/Computers/Programming/Languages/Python/Books/",
"https://dmoztools.net/Computers/Programming/Languages/Python/Resources/",
}
crawler.StartURLs(startURLs)
go run main.go
Enjoy it.
Clone this wiki locally
You can’t perform that action at this time.