User:StefanoZacchiroli/Content deduplication

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Some experiments on deduplicating contents at sub-file granularity.

Datasets

Linux kernel, Git repo

origin: git.kernel.org, on 2018-01-06
1.653.941 content blobs, for a total of 19 GB (compressed)
original size (uncompressed): 55.89 GB

Random sample 0.1%

random sample of Software Heritage contents
0.1%, Bernoulli sampling; 4.111.582 contents
took on 2018-01-09

Rabin fingerprint chunking

Approach: use Rabin fingerprints as in LBFS
Implementation: swh-dedup-blocks.py

test 1

Dataset: linux.git

Rabin fingerprint parameters:

prime: 3
window_size: 48 KB
chunk size (min/avg/max): 2 KB / 8 KB / 64 KB

Results:

average chunk size (effective): 9.37 KB
dedup chunk size (uncompressed): 19.87 GB (35.55%)

test 2

Dataset: linux.git

Rabin fingerprint parameters:

prime: 3
window_size: 48 KB
chunk size (min/avg/max): 512 B / 2 KB / 8 KB

Results:

average chunk size (effective): 2.86 KB
dedup chunk size (uncompressed): 9.09 GB (16.26%)

test 3

Dataset: linux.git

Rabin fingerprint parameters:

prime: 3
window_size: 48 KB
chunk size (min/avg/max): 512 B / 1 KB / 8 KB

Results:

average chunk size (effective): 1.72 KB
dedup chunk size (uncompressed): 6.49 GB (11.60%)

test 4

TODO

contents: 4111582 chunks: 164052312 average chunk size: 1597.60 total content size: 308154402601 total chunk size: 262090652269 (85.05%)

real 22m20,334s user 15m56,586s sys 2m49,606s

User:StefanoZacchiroli/Content deduplication

Contents

Datasets

Linux kernel, Git repo

Random sample 0.1%

Rabin fingerprint chunking

test 1

test 2

test 3

test 4

References

Navigation menu

Search