User:StefanoZacchiroli/Content deduplication: Difference between revisions

Revision as of 12:53, 7 January 2018

Some experiments on deduplicating contents at sub-file granularity.

Rabin fingerprint parameters:

Results:

@@ Line 1: / Line 1: @@
-Some experiments on deduplicating contents at sub-file granularity using [https://en.wikipedia.org/wiki/Rabin_fingerprint Rabin fingerprints].
+Some experiments on deduplicating contents at sub-file granularity.
 == Linux kernel, Git repo ==
@@ Line 6: / Line 6: @@
 * 1.653.941 content blobs, for a total of 19 GB (compressed)
-== test 1 ==
+== Rabin fingerprints ==
+Use [https://en.wikipedia.org/wiki/Rabin_fingerprint Rabin fingerprints].
+=== test 1 ===
 Rabin fingerprint parameters:
@@ Line 18: / Line 22: @@
 * original size (uncompressed): 55.89 GB
 * dedup chunk size (uncompressed): 19.87 GB (35.55%)
+== References ==
+* [https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf LBFS]
+* [http://www2009.eprints.org/7/1/p61.pdf Hailstorm]
+* [https://pdfs.semanticscholar.org/5231/82a9a66f96241d4bab0b92cc92e9cbda22c7.pdf DCT]
+* [https://github.com/nexB/scancode-toolkit/blob/ba5be09e54f57bd519239fd81746ebc4d0f9af9c/src/licensedcode/tokenize.py#L188 ScanCode tokenizer]