You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A dockerized queued high fidelity web archiver based on Squidwarc (Chrome headless), RabbitMQ and a small web frontend. Using the scripting abilities of Squidwarc, you can add scripts that should be run for a specific job (e.g. src-set enrichment, comment expansion etc). Please note that Warcworker is not a crawler (it will not crawl a website automatically - you have to use other software to build lists of URL:s to send to Warcworker).
Installation
Copy .env_example to .env. Update information in .env.
Start with docker-compose up -d --scale worker=3 (wait a minute for everything to start up)
Archiving and playback
Open web front end at https://0.0.0.0:5555 to enter URLs for archiving. You can prefill the text fields with the url and description request parameters. Play back the resulting WARC-files with Webrecorder Player
Using
Bookmarklet
Add a bookmarklet to your browser with the following link:
Now you have two-click web archiving from your browser.
Command line
To use from the command line with curl:
curl -d "scripts=srcset&scripts=scroll_everything&url=https://www.peterkrantz.com/" -X POST https://0.0.0.0:5555/process/
Archivenow handler
To use from archivenow add a handler file handlers/ww_handler.py like this:
importrequestsimportjsonclassWW_handler(object):
def__init__(self):
self.enabled=Trueself.name='Warcworker'self.api_required=Falsedefpush(self, uri_org, p_args=[]):
msg=''try:
# add scripts in the order you want them to be run on the pagepayload= {"url":uri_org, "scripts":["scroll_everything", "srcset"]}
r=requests.post('https://0.0.0.0:5555/process/', timeout=120,
data=payload,
allow_redirects=True)
r.raise_for_status()
return"%s added to queue"%uri_orgexceptExceptionase:
msg="Error ("+self.name+"): "+str(e)
returnmsg
About
A dockerized, queued high fidelity web archiver based on Squidwarc