CARVIEW |
Select Language
HTTP/2 200
date: Wed, 23 Jul 2025 13:36:37 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"8e71a8ae10c1239a542878983e010ab5"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=IMp7F3wToOPofenGUjPXPf1DaIjjnVJzoXiMVvw%2F9k1I2yQF7NbOwfiDSVHrI26gyxkLJ%2Bw%2FpPit1QZ9b93UGtBWbCSfBFT8z4IeeqhUZ9yMuysYU9k7sPxx%2FAxqNjLdmc64y%2FW28cwnRC4oSWlg4E4Rawysha3J9yO0MBuXOp7csVmC%2BRC99yABjvKZ98ABdnirlXlYRMVF6DREd0E5PkTN4%2BD55C6RX4XH1eXjmvAX2qf2OWhyzZUN3UQtosq6drr5GZmnPU9fe4YjI1XTpw%3D%3D--asoIll9JRmLCEeep--Uz7NNsDdKyQtOLX%2BBFehXw%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.923723726.1753277796; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 13:36:36 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 13:36:36 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: 80BA:1D735C:C82D32:ED4D8A:6880E564
Home · dotnetcore/DotnetSpider Wiki · GitHub
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Home
邹嵩 edited this page Apr 4, 2020
·
13 revisions
DotnetSpider 是一个轻量、灵活、高性能、跨平台的分布式网络爬虫框架,可以帮助 .NET 工程师快速的完成爬虫的开发。
如上设计图,整个爬虫设计是纯异步的,利用消息队列进行各个组件的解耦,若是只需要单机爬虫则不需要做任何额外的配置,默认使用了一个内存型的消息队列;若是想要实一个纯分布式爬虫,则需要引入一个消息队列即可,后面会详细介绍如何实现一个分布式爬虫。
爬虫的基本流程是:下载数据(发送 HTTP 请求并获得返回的 resonse) -> 解析返回的文本(可以是 text、json、html) -> 存储解析到的数据,针对这三个主逻辑,我们可以再细下成以下模块。
- Scheduler 调度器:用于对采集请求的去重、采集顺序控制,默认实现了广度优先和深度优先两种调度器。调度器可以采用不同的 Hash 去重器,通常使用默认的 HashSetDuplicateRemover 即可,若是采集量很大可以使用 BloomFilterDuplicateRemover。若想要调度海量的请求或者有重启续跑这样的需求,则需要自行实现基于数据库(关系型数据库、Redis 等)的调度器。
- 下载代理器:下载代理器可以部署在不同的机器上,若是单机爬虫则是每个爬虫实例会启动一个单独的下载代理器。下载代理器负责接收需要下载的请求并使用对应的下载器(HttpClient, Puppter 或者自定义实现的下载器)。
- 下载代理器注册服务:此服务仅用于接收下载代理器的注册、心跳,即便不启用起服务也并不会影响爬虫的使用。单机爬虫会默认启用一个内存型的注册服务。
- 统计服务:统计各个爬虫和下载代理器的运行状态,如爬虫总的请求数、成功的请求数等,下载代理器总的成功请求数、总的消耗时间等
- 请求供应接口:在很多场景下可能下载请求是可以提前知道或存在某个地方(可以是文件、数据库)
- 请求配置 (Spider.ConfigureRequest):一般情况下请求都可以自动构建好,但在某些特别情况下如加 sign 等,可以统一处理。
- DataFlow: 数据流分两种,解析器和存储器。最极端情况是你不想搞那么复杂,解析和存储都自己在一个 DataFlow 中实现。一个爬虫可以有多个 DataFlow,执行顺序按添加顺序,在任意一个 DataFlow 中抛出异常都会中断整个处理流程。
- 代理池:每个爬虫实例会启动一个代理后台服务,此后台服务定时从注册的 IProxySupplier 中获取新的代理,每个获得的新代理需要经过检测成功才会入到代理池。在配置文件中或者 Builder 创建时可以配置测试地址:ProxyTestUri
- 并发控制器:并发控制器以一定速度从 Scheduler 中获取请求并推到到消息队列中,这些请求会缓存在 RequestedQueue 中,这个队列是使用低开销的 HashedWheelTimer 实现的,若在一定时间内未收到下载代理器返回的消息,则认为是 Timeout 触发重试直到超过重试次数限制。
Clone this wiki locally
You can’t perform that action at this time.