You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Converts WARC files to static html while rewriting links to relative paths suitable for browsing offline or rehosting
on a standard web server.
Limitations:
Links in JavaScript are not rewritten
Assumes there's only one snapshot of each URL in the input
Does not handle resource records (yet)
Usage
To convert a file named input.warc.gz to static HTML:
java -jar warc2html.jar -o output/ input.warc.gz
Alternatively if you'd like to convert a subset of records you can supply a list of records in CDX11 format and the
path or URL where the corresponding WARC files are stored:
Files are renamed to remove characters like "?" that are disallowed on some systems. File extensions are updated or added
based on the Content-Type header according to these rules.
URLs ending in / will be saved as index.html. Where two WARC records would produce the same filename they are
disambiguated by adding a number like ~1, ~2, ~3 to the end of the filename.
License
Copyright 2021 National Library of Australia
License: Apache 2.0