Looking for the right archive format with deduplication

I’m often copying my files to an external disk as a backup option. So far, nothing fancy.

However, I end up with many copies of more or less the same folders over time. And that takes a lot of space for nothing.

Incremental backups, that only add new files to an archive, might look as a solution for that, but over time the paths changes, the structure changes. So the deduplication would break.

One solution could be to format my disk with a system such as ZFS or BTRFS which have a deduplication built-in. However I can’t use my disk for quick operations/archiving on Windows, nor on Mac (easily at least).

So I want an archive format with:

  • Deduplication
  • Some compression (lot of source code files)

One might think, and I often came across that statement while searching on the web for a solution, that one shouldn’t worry about that as compression algorithm inheritely support deduplication. That is actually wrong, they’re not made to merge very large patterns. The following experiment shows it with file of random data (incompressible) :

-rwxrwxrwx 1 tbarbette tbarbette 10M Apr 17 21:09 binA
-rwxrwxrwx 1 tbarbette tbarbette 10M Apr 17 21:10 binACOPY
-rwxrwxrwx 1 tbarbette tbarbette 10M Apr 17 21:09 binB
-rwxrwxrwx 1 tbarbette tbarbette 256M Apr 17 21:19 binBIG
-rwxrwxrwx 1 tbarbette tbarbette 256M Apr 17 21:19 binBIGCOPY
-rwxrwxrwx 1 tbarbette tbarbette 10M Apr 17 21:09 binC

Here’s the size for a few formats :

SizeTimeComma,d
tar.gz553M19.48star -cpazf [archive] [file]
tar.xz553M303star -cpaJf [archive] [file]
zpaq287M18.7szpaq add [archive] [file]
wim287M4.4s7zz a [archive].wim [file]
.7z543M37s7zz a [archive].7z [file]
.7z (512M dictionnary)287M140s7zz a [archive].7z -m0=LZMA2:d512m [file]
.7z (1513M dictionnary)287M114s7zz a [archive].7z -m0=LZMA2:d1513m [file]

zpaq is the clear winner considering it can also compress at the same time, however it’s not very widespread, and has no good GUI available. I’m wondering about recovery in case of problems.

wim has no compression, so it will need to be encapsulated in something else. The problem is then to add some files. I have to uncompress the inner wim format first. The idea being to save the same computer again and again, one of those archives is 200G, so it means adding a single file would take a huge time. While zpaq can add one quite fast.

7zip with a dictionary large enough to hold big duplicates would seem to be a good compromise.

Any input?