CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 3
Releases: commoncrawl/cc-downloader
v0.6.1
734baa8
Compare
This new pre-release adds more documentation with details about the installation process. It also corrects some existing typos.
Assets 12
v0.6.0
5fb6ff4
Compare
This release adds support for CC-NEWS as well as validation mechanisms for the crawl reference that the user input when using the download-paths
sub-command.
The release also updates multiple dependencies and bumps both the edition and the rust compiler version to 2024 edition
and 1.85
respectively.
Assets 12
v0.5.2
e2cd0e0
Compare
Changes
In this pre-release we:
- fixed issue #6 by adding a new User Agent
- Introduce refactors so that linter check are all passed
- Introduce a rust workflow for ensuring that the code compiles and test are passed in the
dev
andmain
branches - Introduce changes to the contributing policy so that PRs are merged to the
dev
branch - Introduce slight updates to the documentation
Breaking Changes
There are no breaking changes for this release.
Notes
This pre-release starts organizing the download.rs
file so that cc-downloader
can also be used as a library and so that bindings can be more easily written.
Assets 12
v0.5.1
Compare
Today we are happy to announce cc-downloader
, an experimental command-line tool for downloading Common Crawl data via https
. cc-downloader
is intended to be a user-friendly and polite downloader. It was made in response to the significant increase in downloads of our data in recent months. That was very exciting to see at first, especially in terms of the large rise in interest for our dataset. But it also makes it harder for some users to successfully download our data due to quirks of downloading from a high-traffic storage bucket.
cc-downloader
is our solution to this problem, enabling our users to continue downloading our data via https
without issues. We have designed cc-downloader
with a polite retry mechanism that allows our users to make sure that every single file requested is downloaded. It also implements jitter and exponential backoff strategies, in order to avoid overwhelming our infrastructure.
If you wish to install cc-downloader
, we have released pre-compiled binaries here for all major operating systems and architectures. cc-downloader
is written in Rust
and is distributed as a “crate”, so if you have cargo
installed, you can also install cc-downloader
with this command:
cargo install cc-downloader
Once you have installed it, you’ll see that cc-downloader
has 2 sub-commands:
First, download-paths
downloads the file paths list for a given crawl and subset from our bucket, to a given destination folder path in your file system:
cc-downloader download-paths CC-MAIN-2024-46 wet path/to/folder
This paths file will be (in this case) path/to/folder/wet.paths.gz
.
Next, download
reads this file paths list and concurrently downloads the files to a given destination folder in your file system:
cc-downloader download path/to/folder/wet.paths.gz path/to/folder
This will preserve the tree structure that we use internally by default.
cc-downloader
is still under active development, so if you find any issue or would like to submit a feature request, please visit our GitHub repository https://github.com/commoncrawl/cc-downloader/. Contributions are always welcome! We hope that with this tool our users will find it easier to download and use our data.
If you’re encountering any problems with cc-downloader
that look like high traffic, you can check out our current traffic levels by looking at our infrastructure status webpage.