User:StefanoZacchiroli/Content deduplication

From Software Heritage Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Some experiments on deduplicating contents at sub-file granularity.

Datasets

Linux kernel, Git repo

  • origin: git.kernel.org, on 2018-01-06
  • 1.653.941 content blobs, for a total of 19 GB (compressed)

Rabin fingerprints

test 1: linux git

Rabin fingerprint parameters:

  • prime: 3
  • window_size: 48 KB
  • min_block_size: 2 KB
  • avg_block_size: 8 KB
  • max_block_size: 64 KB

Results:

  • original size (uncompressed): 55.89 GB
  • dedup chunk size (uncompressed): 19.87 GB (35.55%)

References