User:StefanoZacchiroli/Content deduplication
< User:StefanoZacchiroli
Jump to navigation
Jump to search
Revision as of 10:20, 8 January 2018 by StefanoZacchiroli (talk | contribs)
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Some experiments on deduplicating contents at sub-file granularity.
Datasets
Linux kernel, Git repo
- origin: git.kernel.org, on 2018-01-06
- 1.653.941 content blobs, for a total of 19 GB (compressed)
- original size (uncompressed): 55.89 GB
Rabin fingerprints
- Approach: use Rabin fingerprints
- Implementation: swh-dedup-blocks.py
test 1
Dataset: linux.git
Rabin fingerprint parameters:
- prime: 3
- window_size: 48 KB
- chunk size (min/avg/max): 2 KB / 8 KB / 64 KB
Results:
- average chunk size (effective): 9.37 KB
- dedup chunk size (uncompressed): 19.87 GB (35.55%)
test 2
Dataset: linux.git
Rabin fingerprint parameters:
- prime: 3
- window_size: 48 KB
- chunk size (min/avg/max): 512 B / 2 KB / 8 KB
Results:
- average chunk size (effective): 5.07 KB
- dedup chunk size (uncompressed): 16.19 GB (28.96%)