You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm trying to run archivebox/archivebox:dev in docker compose, but I would like to set PUID and PGID environment variables to 33 (because I eventually want to use a folder owned by www-data on an SMB mount for the archives). Even with the data folder mounted as a named docker volume (so no SMB weirdness involved yet), multiple extractors are failing. If I don't set PUID and PGID, all extractors are working. How can I fix this?
$ > docker compose run --rm archivebox add https://example.com/
[i] [2024-02-28 02:09:19] ArchiveBox v0.7.3: archivebox add https://example.com/
> /data
[+] [2024-02-28 02:09:21] Adding 1 links to index (crawl depth=0)...
> Saved verbatim input to sources/1709086161-import.txt
> Parsed 1 URLs from input (Generic TXT)
> Found 1 new URLs not already in index
[*] [2024-02-28 02:09:21] Writing 1 links to main index...
β ./index.sqlite3
[*] [2024-02-28 02:09:21] Archiving 1/1 URLs from added set...
[βΆ] [2024-02-28 02:09:21] Starting archiving of 1 snapshots in index...
[+] [2024-02-28 02:09:21] "example.com"
https://example.com/
> ./archive/1709086161.608541
> favicon
> headers
> singlefile
Extractor failed:
SingleFile was not able to archive the page
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1709086161.608541;
/app/node_modules/single-file-cli/single-file --browser-executable-path=chromium-browser "--browser-args=[\"--headless=new\", \"--no-sandbox\", \"--no-zygote\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--autoplay-policy=no-user-gesture-required\", \"--no-first-run\", \"--use-fake-ui-for-media-stream\", \"--use-fake-device-for-media-stream\", \"--disable-sync\", \"--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'\", \"--window-size=1440,2000\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/)\"]" "https://example.com/" singlefile.html
> pdf
Extractor failed:
Failed to save PDF
chrome_crashpad_handler: --database is required
Try 'chrome_crashpad_handler --help' for more information.
[143:143:0228/020923.342009:ERROR:socket.cc(120)] recvmsg: Connection reset by peer (104)
[143:143:0228/020923.342267:FATAL:crashpad_linux.cc(195)] Check failed: client.StartHandler(handler_path, *database_path, metrics_path, url, annotations, arguments, false, false).
#0 0x561630a9f932 base::debug::CollectStackTrace()
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1709086161.608541;
/usr/bin/chromium-browser --headless=new --no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --autoplay-policy=no-user-gesture-required --no-first-run --use-fake-ui-for-media-stream --use-fake-device-for-media-stream --disable-sync "--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'" --window-size=1440,2000 "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/)" --print-to-pdf "https://example.com/"
> screenshot
Extractor failed:
Failed to save screenshot
chrome_crashpad_handler: --database is required
Try 'chrome_crashpad_handler --help' for more information.
[147:147:0228/020923.873534:FATAL:crashpad_linux.cc(195)] Check failed: client.StartHandler(handler_path, *database_path, metrics_path, url, annotations, arguments, false, false).
#0 0x5650f477f932 base::debug::CollectStackTrace()
#1 0x5650f476d573 base::debug::StackTrace::StackTrace()
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1709086161.608541;
/usr/bin/chromium-browser --headless=new --no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --autoplay-policy=no-user-gesture-required --no-first-run --use-fake-ui-for-media-stream --use-fake-device-for-media-stream --disable-sync "--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'" --window-size=1440,2000 "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/)" --screenshot "https://example.com/"
> dom
Extractor failed:
Failed to save DOM
chrome_crashpad_handler: --database is required
Try 'chrome_crashpad_handler --help' for more information.
[151:151:0228/020924.330280:FATAL:crashpad_linux.cc(195)] Check failed: client.StartHandler(handler_path, *database_path, metrics_path, url, annotations, arguments, false, false).
#0 0x555ab663d932 base::debug::CollectStackTrace()
#1 0x555ab662b573 base::debug::StackTrace::StackTrace()
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1709086161.608541;
/usr/bin/chromium-browser --headless=new --no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --autoplay-policy=no-user-gesture-required --no-first-run --use-fake-ui-for-media-stream --use-fake-device-for-media-stream --disable-sync "--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'" --window-size=1440,2000 "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/)" --dump-dom "https://example.com/"
> wget
> title
Extractor failed:
Unable to detect page title
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1709086161.608541;
curl --silent --location --compressed --max-time 60 --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)" "https://example.com/"
> readability
Extractor failed:
Readability could not find HTML to parse for article text
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1709086161.608541;
curl ./{dom,singlefile}.html
> mercury
> htmltotext
Extractor failed:
htmltotext could not find HTML to parse for article text
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1709086161.608541;
"(internal) archivebox.extractors.htmltotext" ./{singlefile,dom}.html
> media
11 files (18.1 MB) in 0:00:08s
[β] [2024-02-28 02:09:30] Update of 1 pages complete (8.67 sec)
- 0 links skipped
- 1 links updated
- 1 links had errors
Hint: To manage your archive in a Web UI, run:
archivebox server 0.0.0.0:8000
In general www-data is a very restricted user, I'm not surprised it's breaking things. All uids below 100 are sort of a special case because they often have pre-existing permissions restrictions on most linux systems, and setting archivebox to use that uid is basically inheriting all of the restrictions that uid already has.
I think uid=33 maps to an existing unprivileged www-data user inside docker, so everything is failing because that user doesn't have enough permissions to write to any of the system /var, etc. directories that it needs during archiving.
You could try leaving PUID as the default, but setting PGID=33? (or vice versa if that fails)
You can also experiment with creating a new group outside of docker, adding www-data to that group, and then passing in its PGID. This will allow the default archivebox (uid=911) user to run the archiving processes inside Docker (which will fix all the errors you're seeing), while also keeping the data dir owned by the group ID that gives your host www-data user access outside of docker.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm trying to run
archivebox/archivebox:dev
in docker compose, but I would like to set PUID and PGID environment variables to 33 (because I eventually want to use a folder owned by www-data on an SMB mount for the archives). Even with the data folder mounted as a named docker volume (so no SMB weirdness involved yet), multiple extractors are failing. If I don't set PUID and PGID, all extractors are working. How can I fix this?Beta Was this translation helpful? Give feedback.
All reactions