CARVIEW |
Select Language
HTTP/2 200
date: Wed, 23 Jul 2025 03:17:37 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"9c6136b313de59a5535d5065dba17055"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=QWj37BtRMVFXRsAe8GkEfa3ki94FVVYBxFwvVIgRYLupPg%2BGSGmaE3IIqSVNSy7jFQ0OhPlaZNnfc0I2fSOzN3q%2FC2HAIhxQUzjYnAo5fQyJlGLo94fdJ%2BJ7o2j4dpV7kMvP0WKe7t4kYBQ2wqijh1fPzgIcAMNsGwmZTGdUwdaCKDlat9uNNDzOxvamamRCmed2BTP6xNa8lm93tn6kkZ%2BnXHR6YdceQcpd%2FjiuHDCuaQxczLFopMum9cV4qHBhnL03X%2BPzbMjrGx4T2pcMDg%3D%3D--6g0K0SOzmWGCagOn--KdP6UAJEuIh21byku3aQ%2FA%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1350407454.1753240657; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 03:17:37 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 03:17:37 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: B3C4:36131:2B2FCA:3CFFB1:68805451
Roadmap · ArchiveBox/ArchiveBox Wiki · GitHub
Skip to content
Navigation Menu
{{ message }}
-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Roadmap
Nick Sweeting edited this page May 8, 2024
·
88 revisions
Official Roadmap Discussion.
(this is not set in stone, just a rough estimate)
- move config loading logic into settings.py
- move all the extractors into "plugin" style folders that register their own config
- right now, the paths of the extractor output are scattered all over the codebase, e.g.
output.pdf
(should be moved to constants at the top of the plugin config file) - make out_dir, link_dir, extractor_dir, naming consistent across codebase
- remove
timestamps
as primary keys in favor of hashes, UUIDs, or some other slug https://github.com/ArchiveBox/ArchiveBox/issues/74 - create a migration system for folder layout independent of the index (
mv
is atomic at the FS level, so we just need atransaction.atomic(): move(oldpath, newpath); snap.data_dir = newpath; snap.save()
) - make
Tag
a real modelManyToMany
with Snapshots - allow multiple Snapshots of the same site over time + CLI / UI to manage those, + migration from old style
#2020-01-01
hack to proper versioned snapshots - upgrade from Django 3 to Django 5 https://github.com/ArchiveBox/ArchiveBox/issues/988
- Add CSRF/CSP/XSS protection to rendered archive pages
- Provide secure reverse proxy in front of archivebox server in docker-compose.yml
- Create UX flow for users to setup session cookies / auth for archiving private sites
- cookies for wget, curl, etc low-level commands
- localstorage, cookies, indexedb setup for chrome archiving methods
- setup huey, break up archiving process into tasks on a queue that a worker pool executes
- setup pyppeteer2 to wrap chrome so that it's not open/closed during each extractor
- run user-scripts / extensions in the context of the page during archiving
- community userscripts for unrolling twitter threads, reddit threads, youtube comment sections, etc.
- pywb-based headless browser session recording and warc replay
- archive proxy support
- support sending upstream requests through an external proxy
- support for exposing a proxy that archives all downstream traffic
...
- ZFS / merkel tree for storing archive output subresource hashes
- DHT for assigning merkel tree hash:file shards to nodes
- tag system for tagging certain hashes with human-readable names, e.g. title, url, tags, filetype etc.
- distributed tag lookup system
- ✅ release
pip
,apt
,pkg
, andbrew
packaged distributions for installing ArchiveBox - ✅ add an optional web GUI for managing sources, adding new links, and viewing the archive
- ✅ switch to django + sqlite db with migrations system & json/html export for managing archive schema changes and persistence
- modularize internals to allow importing individual components
- switch to sha256 of URL as unique link ID
- support storing multiple snapshots of pages over time
- support custom user puppeteer scripts to run while archiving (e.g. for expanding reddit threads, scrolling thread on twitter, etc)
- support named collections of archived content with different user access permissions
- support sharing archived assets via DHT + torrent / ipfs / ZeroNet / other sharing system
- support pushing pages to multiple 3rd party services using ArchiveNow instead of just archive.org
- ✅ body text extraction to markdown (using
fathomreadability and mercury) - featured image / thumbnail extraction
- auto-tagging links based on important/frequent keywords in extracted text (like pocket)
- automatic article summary paragraphs from extracted text with nlp summarization library
- ✅ full-text search of extracted text with
elasticsearch/elasticlunr/agsonic and ripgrep - ✅ download closed-caption subtitles from Youtube and other video sites (TODO: submit the subtitle files to the full-text search index)
- try pulling dead sites from archive.org and other sources if original is down (https://github.com/hartator/wayback-machine-downloader)
- And more in the issues list...
IMPORTANT: Please don't work on any of these major long-term tasks without contacting me first, work is already in progress for many of these, and I may have to reject your PR if it doesn't align with the existing work!
To see how this spec has been scheduled / implemented / released so far, read these pull requests:
- ✅ v0.1.x pre-git-history (~2017)
- ✅ v0.2.x (~2018/12)
- ✅ v0.3.x (~2019/03)
- ✅ v0.4.x (~2019/04)
- ✅ v0.5.x (~2020/11)
- ✅ v0.6.x (~2021/03)
- 🏖️
sabbatical / coding hiatus during 2022
- ✅ v0.7.x (~2023/11)
- 🛠 v0.8.x (~2024/05)
- 📅 v0.9.x up next...
- https://github.com/ArchiveBox/ArchiveBox/issues/1358
- https://github.com/ArchiveBox/ArchiveBox/issues/1273
- https://github.com/ArchiveBox/ArchiveBox/issues/988
- https://github.com/ArchiveBox/ArchiveBox/issues/930
-
gallery-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/564 -
forum-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/1368 -
scihub-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/720 -
cad-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/668 -
aria2
: https://github.com/ArchiveBox/ArchiveBox/issues/1355 -
podcast-archiver
: https://github.com/ArchiveBox/ArchiveBox/issues/1357 -
bdfr
: https://github.com/ArchiveBox/ArchiveBox/issues/778 -
cutycapt
screenshots: https://github.com/ArchiveBox/ArchiveBox/issues/253 - sourcemap downloader: https://github.com/ArchiveBox/ArchiveBox/issues/1291
ArchiveBox Developer Documentation: Contributing a New Extractor
And others we're considering for the future:
- Instagram
- https://github.com/instaloader/instaloader (instagram downloader)
- https://github.com/althonos/InstaLooter (stale)
- Telegram
- https://github.com/iyear/tdl (telegram downloader)
- TikTok
- https://github.com/charmparticle/tiktokget (tiktok downloader using yt-dlp)
- https://github.com/TerminalWarlord/TikTok-Downloader-Bot
- https://github.com/n0l3r/tiktok-downloader
- https://github.com/hansputera/tiktok-dl
- https://github.com/naseif/tiktok-scraper
- https://github.com/irevenko/tiktik
- https://github.com/samirelanduk/tiktok-save
- https://github.com/Dinoosauro/tiktok-to-ytdlp
- https://github.com/krypton-byte/tiktok-downloader
- Twitter
- https://github.com/HoloArchivists/twspace-dl (stale, twitter spaces archiver)
- https://github.com/soimort/you-get ⭐️
- https://github.com/lay295/TwitchDownloader
- https://github.com/ihabunek/twitch-dl
- https://github.com/iawia002/lux (generic video/audio downloader)
- https://github.com/wukko/cobalt (generic video/audio downloader)
- https://github.com/jaysonlong/webvideo-downloader (Bilibili, iQIYI, Tencent Video, MGTV and WeTV)
- https://github.com/spaam/svtplay-dl (comedy central, twitch, HBO, etc. video downloader)
- https://github.com/aajanki/yle-dl (Yle Areena Finnish broadcasting video downloader)
- https://github.com/WHTJEON/widevine-dl (encrypted widevine video downloader)
- https://github.com/nathom/streamrip (Qobuz, Tidal, Deezer and SoundCloud)
- https://github.com/0xHJK/music-dl
- https://github.com/guanguans/music-dl
- https://github.com/CharlesPikachu/musicdl
- https://github.com/iheanyi/bandcamp-dl
- https://github.com/spotDL/spotify-downloader
- https://github.com/Shabinder/SpotiFlyer
- https://github.com/SathyaBhat/spotify-dl / https://github.com/SwapnilSoni1999/spotify-dl / https://github.com/dhruv-ahuja/spoti-dl
- https://github.com/vitiko98/qobuz-dl (Qobuz music downloader)
- https://github.com/akhilrex/podgrab (stale)
- https://github.com/yaronzz/Tidal-Media-Downloader-PRO (stale)
- https://github.com/flyingrub/scdl (stale)
- https://github.com/ravishi/rdio-dl (stale, Rdio song downloader)
- https://github.com/carlosflorencio/laracasts-downloader (stale?)
- https://github.com/mikf/gallery-dl ⭐️
- https://github.com/Bionus/imgbrd-grabber (generic image board downloader like gallery-dl)
- https://github.com/Xonshiz/comic-dl (comic, anime, manga, etc. downloader)
- https://github.com/justfoolingaround/animdl (anime downloader)
- https://github.com/metafates/mangal (manga downloader)
- https://github.com/boredazfcuk/docker-icloudpd (iCloud Photos downloader)
- https://github.com/Oshan96/monkey-dl (stale? anime downloader)
- https://github.com/QianyanTech/Image-Downloader (stale?)
- https://github.com/Xonshiz/anime-dl (stale?)
- https://github.com/mikwielgus/forum-dl ⭐️
- https://github.com/AndyTheFactory/newspaper4k ⭐️
- https://github.com/AAndyProgram/SCrawler (Twitter, Reddit, Instagram, Threads, Facebook, Pinterest, nsfw sites downloader)
- https://github.com/extractus/article-extractor
- https://github.com/shadowmoose/RedditDownloader (stale?)
- https://github.com/aliparlakci/bulk-downloader-for-reddit (stale?)
- https://github.com/coursera-dl/coursera-dl
- https://github.com/rand-net/khan-dl
- https://github.com/C0D3D3V/Moodle-DL
- https://github.com/r0oth3x49/acloud-dl
- https://github.com/Puyodead1/udemy-downloader
- https://github.com/PyJun/Mooc_Downloader (stale)
- https://github.com/yann0917/dedao-dl (stale, MOOC course downloader)
- https://github.com/coursera-dl/edx-dl (stale?)
- https://github.com/SigureMo/mooc-dl (stale?)
- https://github.com/calvinhobbes23/Skillshare-DL (stale)
- https://github.com/r0oth3x49/lynda-dl (stale, Lynda.com course downloader)
- https://github.com/hartator/wayback-machine-downloader
- https://github.com/MiniGlome/Archive.org-Downloader
- https://github.com/ArchiveTeam/grab-site
- https://github.com/oduwsdl/archivenow
- https://github.com/wabarc/warcraft
- https://github.com/sul-dlss/wasapi-downloader
- https://github.com/KellyStathis/warc_downloader
- https://github.com/internetarchive/heritrix3
- https://github.com/AhmadIbrahiim/Website-downloader (wget wrapper)
- https://github.com/igrigorik/gharchive.org (stale? Github downloader)
- https://github.com/KurtBestor/Hitomi-Downloader
- https://github.com/nilaoda/BBDown
- https://github.com/biliup/biliup
- https://github.com/yutto-dev/bilili
- https://github.com/nICEnnnnnnnLee/BilibiliDown
- https://github.com/matlink/gplaycli (Google Play store Android app downloader)
- https://github.com/AlphaSlayer1964/kemono-dl (Patreon, gumroad, etc. archiver)
- https://github.com/manga-download/hakuneko
- https://github.com/cancerian0684/dli-downloader (Digital Library of India ebook downloader)
- https://github.com/tusharbabbar/gaana-dl (gaana.com bollywood song downloader)
- https://github.com/rebane2001/matterport-dl (stale? virtual house tour downloader)
- 🔢 Quickstart
- 🖥️ Install
- 🐳 Docker
- ➡️ Supported Sources
- ⬅️ Supported Outputs
- ﹩Command Line
- 🌐 Web UI
- 🧩 Browser Extension
- 👾 REST API / Webhooks
- 📜 Python API / REPL / SQL API
- Upgrading
- Setting up Storage (NFS/SMB/S3/etc)
- Setting up Authentication (SSO/LDAP/etc)
- Setting up Search (rg/sonic/etc)
- Scheduled Archiving
- Publishing Your Archive
- Chromium Install
- Cookies & Sessions Setup
- Merging Collections
- Troubleshooting
- ⭐️ Web Archiving Community
- Background & Motivation
- Comparison to Other Tools
- Architecture Diagram
- Changelog & Roadmap
Clone this wiki locally
You can’t perform that action at this time.