Difference between revisions of "User:StefanoZacchiroli/Content deduplication"
Jump to navigation
Jump to search
m (StefanoZacchiroli moved page User:StefanoZacchiroli/Scratch/Content deduplication using Rabin fingerprints to User:StefanoZacchiroli/Scratch/Content deduplication) |
|||
Line 1: | Line 1: | ||
− | Some experiments on deduplicating contents at sub-file granularity | + | Some experiments on deduplicating contents at sub-file granularity. |
== Linux kernel, Git repo == | == Linux kernel, Git repo == | ||
Line 6: | Line 6: | ||
* 1.653.941 content blobs, for a total of 19 GB (compressed) | * 1.653.941 content blobs, for a total of 19 GB (compressed) | ||
− | == test 1 == | + | == Rabin fingerprints == |
+ | |||
+ | Use [https://en.wikipedia.org/wiki/Rabin_fingerprint Rabin fingerprints]. | ||
+ | |||
+ | === test 1 === | ||
Rabin fingerprint parameters: | Rabin fingerprint parameters: | ||
Line 18: | Line 22: | ||
* original size (uncompressed): 55.89 GB | * original size (uncompressed): 55.89 GB | ||
* dedup chunk size (uncompressed): 19.87 GB (35.55%) | * dedup chunk size (uncompressed): 19.87 GB (35.55%) | ||
+ | |||
+ | == References == | ||
+ | |||
+ | * [https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf LBFS] | ||
+ | * [http://www2009.eprints.org/7/1/p61.pdf Hailstorm] | ||
+ | * [https://pdfs.semanticscholar.org/5231/82a9a66f96241d4bab0b92cc92e9cbda22c7.pdf DCT] | ||
+ | * [https://github.com/nexB/scancode-toolkit/blob/ba5be09e54f57bd519239fd81746ebc4d0f9af9c/src/licensedcode/tokenize.py#L188 ScanCode tokenizer] |
Revision as of 12:53, 7 January 2018
Some experiments on deduplicating contents at sub-file granularity.
Linux kernel, Git repo
- origin: git.kernel.org, on 2018-01-06
- 1.653.941 content blobs, for a total of 19 GB (compressed)
Rabin fingerprints
Use Rabin fingerprints.
test 1
Rabin fingerprint parameters:
- prime: 3
- window_size: 48 KB
- min_block_size: 2 KB
- avg_block_size: 8 KB
- max_block_size: 64 KB
Results:
- original size (uncompressed): 55.89 GB
- dedup chunk size (uncompressed): 19.87 GB (35.55%)