User:StefanoZacchiroli/Content deduplication

From Software Heritage Wiki

< User:StefanoZacchiroli

Revision as of 13:02, 7 January 2018 by StefanoZacchiroli (talk | contribs) (→‎Rabin fingerprints)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Some experiments on deduplicating contents at sub-file granularity.

Datasets

Linux kernel, Git repo

origin: git.kernel.org, on 2018-01-06
1.653.941 content blobs, for a total of 19 GB (compressed)

Rabin fingerprints

Approach: use Rabin fingerprints
Implementation: swh-dedup-blocks.py

test 1: linux git

Rabin fingerprint parameters:

prime: 3
window_size: 48 KB
min_block_size: 2 KB
avg_block_size: 8 KB
max_block_size: 64 KB

Results:

original size (uncompressed): 55.89 GB
dedup chunk size (uncompressed): 19.87 GB (35.55%)

References

Retrieved from "https://wiki.softwareheritage.org/index.php?title=User:StefanoZacchiroli/Content_deduplication&oldid=755"