Feature Request: Deduplicate files on archives

Hello :)

## Type 

 - [ ] General question or discussion
 - [x] Propose a brand new feature
 - [ ] Request modification of existing behavior or design

## What is the problem that your feature request solves

When archiving a lot of pages, it is possible that some files remain identical between each of these pages. The problem is that these files take up more and more space even though they are still identical. 
There are solutions on the file system side (ZFS for example) but on the application side it is more complex.
I'm thinking of using Rdfind and coupling it to a script to transform duplicate links into hardlinks.

I've been thinking of using rdfind and finding all the files to make a hardlink, that way you delete the original page, you don't lose the other files. But I'm afraid to make archivebox crazy in the future with my tricks ^^

## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

I think use link (hard or symlink) store in a "global folder" and each archive use theses files. The duplicate file have the same md5 hash and each hash are stored in DB to find quickly without many IO duplicate file

## What hacks or alternative solutions have you tried to solve the problem?

Not tried but I think use rdfind to find or hash each file.

## How badly do you want this new feature?

 - [ ] It's an urgent deal-breaker, I can't live without it
 - [x] It's important to add it in the near-mid term future
 - [x] It would be nice to have eventually

(Yes both, is it a nice to have but my disk space say it's important ^^)
---

 - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue
 - [x] I like ArchiveBox so far / would recommend it to a friend (and write an article ^^) 
 - [ ] I've had a lot of difficulty getting ArchiveBox set up


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature Request: Deduplicate files on archives #704

Type

What is the problem that your feature request solves

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

What hacks or alternative solutions have you tried to solve the problem?

How badly do you want this new feature?

(Yes both, is it a nice to have but my disk space say it's important ^^)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Feature Request: Deduplicate files on archives #704

Description

Type

What is the problem that your feature request solves

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

What hacks or alternative solutions have you tried to solve the problem?

How badly do you want this new feature?

(Yes both, is it a nice to have but my disk space say it's important ^^)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions