Statistics/Content size

From Software Heritage Wiki
Revision as of 15:13, 19 October 2017 by StefanoZacchiroli (talk | contribs) (→‎Reproducibility)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Results

Last run on 2017-10-19 12:38:24 +0000, querying the replica DB on somerset.internal.softwareheritage.org.

Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000e+00 1.052e+03 3.387e+03 8.011e+04 1.247e+04 1.314e+09

Unit is length in bytes of uncompressed contents.

Sample size: 3.837.249 contents.

Content sizes.png

Reproducibility

SQL:

SELECT length
FROM content TABLESAMPLE BERNOULLI (0.1);  -- 0.1% of known blobs

Shell:

psql service=swh-replica -At -f content-size-1pct.sql | gzip -9c > content-size-1pct.txt.gz

R:

sizes <- scan(gzfile('content-size-1pct.txt.gz'))

length(sizes)
summary(sizes)

png('content_sizes.png')
hist(log10(sizes))
dev.off()