Statistics/Content size: Difference between revisions
Jump to navigation
Jump to search
| Line 31: | Line 31: | ||
SELECT length | SELECT length | ||
FROM content TABLESAMPLE BERNOULLI (0.1); -- 0.1% of known blobs | FROM content TABLESAMPLE BERNOULLI (0.1); -- 0.1% of known blobs | ||
</pre> | |||
Shell: | |||
<pre> | |||
psql service=swh-replica -At -f content-size-1pct.sql | gzip -9c > content-size-1pct.txt.gz | |||
</pre> | </pre> | ||
Latest revision as of 15:13, 19 October 2017
Results
Last run on 2017-10-19 12:38:24 +0000, querying the replica DB on somerset.internal.softwareheritage.org.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 2.000e+00 | 1.052e+03 | 3.387e+03 | 8.011e+04 | 1.247e+04 | 1.314e+09 |
Unit is length in bytes of uncompressed contents.
Sample size: 3.837.249 contents.
Reproducibility
SQL:
SELECT length FROM content TABLESAMPLE BERNOULLI (0.1); -- 0.1% of known blobs
Shell:
psql service=swh-replica -At -f content-size-1pct.sql | gzip -9c > content-size-1pct.txt.gz
R:
sizes <- scan(gzfile('content-size-1pct.txt.gz'))
length(sizes)
summary(sizes)
png('content_sizes.png')
hist(log10(sizes))
dev.off()
