CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Describe the bug
As part of my ridiculously large archiving attempt (partly documented in #233), I have done a first batch of URL imports with the first 100 URLs found. For a reason I can't explain (maybe because I ran two archivebox add
commands in parallel?), that eventually crashed with:
FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json'
No problem, I thought - I can resume! So I did that with
archivebox add --update-all
But that crashed as well, with:
TypeError: save_file_to_sources(..., path: str) got unexpected NoneType argument path=None
I suspect this is because --update-all
actually expects a list of URLs to be passed, but the usage doesn't make that clear and we shouldn't be crashing there.
Steps to reproduce
- call
archivebox add --update-all
with no other URLs
Screenshots or log output
First, the original crash, not the subject of this bug report:
[...]
[+] [2019-05-06 21:39:14] "www.hjdskes.nl/projects/cage"
https://www.hjdskes.nl/projects/cage/
> ./archive/1557178364.10
> title
> favicon
> wget
Failed:
TimeoutExpired Command 'wget' timed out after 60 seconds
Run to see full output:
cd /srv/backup/archive/archivebox/archive/1557178364.10;
wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1557178755 --page-requisites "--user-agent=ArchiveBox/0.4.1 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://www.hjdskes.nl/projects/cage/
> pdf
> screenshot
> dom
> media
> archive_org
! Failed to archive link: FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json'
Traceback (most recent call last):
File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module>
sys.exit(main())
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main
archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main
pwd=pwd or OUTPUT_DIR,
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main
out_dir=pwd or OUTPUT_DIR,
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 521, in add
archive_link(link, out_dir=link.link_dir)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/extractors/__init__.py", line 85, in archive_link
patch_main_index(link)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/__init__.py", line 323, in patch_main_index
write_json_main_index(patched_links, out_dir=out_dir)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/json.py", line 77, in write_json_main_index
atomic_write(main_index_json, os.path.join(out_dir, JSON_INDEX_FILENAME))
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/system.py", line 79, in atomic_write
os.rename(tmp_file, path)
FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json'
Readding the list does nothing:
[1]anarcat@curie:archivebox(master)$ archivebox add wallabag-p1.list
> ./sources/wallabag-p1.list-1557179130.txt
[*] [2019-05-06 21:45:30] Parsing new links from output/sources/wallabag-p1.list-1557179130.txt...
> Parsed 100 links as Plain Text (0 new links added)
[*] [2019-05-06 21:45:30] Writing 101 links to main index...
√ /srv/backup/archive/archivebox/index.sqlite3
√ /srv/backup/archive/archivebox/index.json
√ /srv/backup/archive/archivebox/index.html
[▶] [2019-05-06 21:45:31] Updating content for 0 matching pages in archive...
[√] [2019-05-06 21:45:31] Update of 0 pages complete (0.00 sec)
- 0 links skipped
- 0 links updated
- 0 links had errors
To view your archive, open:
/srv/backup/archive/archivebox/index.html
Or run the built-in webserver:
archivebox server
[*] [2019-05-06 21:45:31] Writing 101 links to main index...
√ /srv/backup/archive/archivebox/index.sqlite3
√ /srv/backup/archive/archivebox/index.json
√ /srv/backup/archive/archivebox/index.html
Looking at -h
, I noticed --update-all
so I try that:
anarcat@curie:archivebox(master)$ archivebox add --update-all
Traceback (most recent call last):
File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module>
sys.exit(main())
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main
archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main
pwd=pwd or OUTPUT_DIR,
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main
out_dir=pwd or OUTPUT_DIR,
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 496, in add
import_path = save_file_to_sources(import_path, out_dir=out_dir)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 98, in typechecked_function
check_argument_type(arg_key, arg_val)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 92, in check_argument_type
str(arg_val)[:64],
TypeError: save_file_to_sources(..., path: str) got unexpected NoneType argument path=None
The correct call is of course to retry with the same URLs:
anarcat@curie:archivebox(master)$ archivebox add --update-all wallabag-p1.list
which works, but it would actually be nice to (a) not crash when --update-all
is passed without an argument (maybe just error in argument parsing more politely) and (b) eventually just do the right thing, which is probably to retry any failed URL from the database.
Software versions
- OS: Debian buster 10 up to date
- ArchiveBox version: 0.4.1 installed from pip
- Python version: 3.7.3something
- Chrome version: irrelevant?
Thanks for your hard work, and sorry for the flood of bug reports! :)