You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can start multiple spider instances that share a single redis queue.
Best suitable for broad multi-domain crawls.
Distributed post-processing
Scraped items gets pushed into a redis queued meaning that you can start as
many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components
Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
In this forked version: added json supported data in Redis
data contains url, `meta` and other optional parameters. meta is a nested json which contains sub-data.
this function extract this data and send another FormRequest with url, meta and addition formdata.
this data can be accessed in scrapy spider through response.
like: request.url, request.meta, request.cookies
Note
This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera project.
Requirements
Python 3.7+
Redis >= 5.0
Scrapy >= 2.0
redis-py >= 4.0
Installation
From pip
pip install scrapy-redis
From GitHub
git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install
Note
For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.
pip uninstall scrapy-redis
Alternative Choice
Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.