You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Uses sitemap.xml to seed the initial crawl of the site
Built around a parallel task async/await system
Swappable request and content processors, allowing greater customisation
Auto-throttling (see below)
Licensing and Support
Infinity Crawler is licensed under the MIT license. It is free to use in personal and commercial projects.
There are support plans available that cover all active Turner Software OSS projects.
Support plans provide private email support, expert usage advice for our projects, priority bug fixes and more.
These support plans help fund our OSS commitments to provide better software for everyone.
Polite Crawling
The crawler is built around fast but "polite" crawling of website.
This is accomplished through a number of settings that allow adjustments of delays and throttles.
You can control:
Number of simulatenous requests
The delay between requests starting (Note: If a crawl-delay is defined for the User-agent, that will be the minimum)
Artificial "jitter" in request delays (requests seem less "robotic")
Timeout for a request before throttling will apply for new requests
Throttling request backoff: The amount of time added to the delay to throttle requests (this is cumulative)
Minimum number of requests under the throttle timeout before the throttle is gradually removed
Other Settings
Control the UserAgent used in the crawling process
Set additional host aliases you want the crawling process to follow (for example, subdomains)