Difference between revisions of "Statistics/Content size"

From Software Heritage Wiki
Jump to: navigation, search
(Created page with "TODO")
 
(Reproducibility)
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
TODO
+
== Results ==
 +
 
 +
Last run on <code>2017-10-19 12:38:24 +0000</code>, querying the replica DB on somerset.internal.softwareheritage.org.
 +
 
 +
{| class="wikitable"
 +
! Min.
 +
! 1st Qu.
 +
! Median
 +
! Mean
 +
! 3rd Qu.
 +
! Max.
 +
|-
 +
| 2.000e+00
 +
| 1.052e+03
 +
| 3.387e+03
 +
| 8.011e+04
 +
| 1.247e+04
 +
| 1.314e+09
 +
|}
 +
 
 +
Unit is length in bytes of ''uncompressed'' contents.
 +
 
 +
Sample size: 3.837.249 contents.
 +
 
 +
[[File:Content_sizes.png]]
 +
 
 +
== Reproducibility ==
 +
 
 +
SQL:
 +
<pre>
 +
SELECT length
 +
FROM content TABLESAMPLE BERNOULLI (0.1);  -- 0.1% of known blobs
 +
</pre>
 +
 
 +
Shell:
 +
<pre>
 +
psql service=swh-replica -At -f content-size-1pct.sql | gzip -9c > content-size-1pct.txt.gz
 +
</pre>
 +
 
 +
R:
 +
<pre>
 +
sizes <- scan(gzfile('content-size-1pct.txt.gz'))
 +
 
 +
length(sizes)
 +
summary(sizes)
 +
 
 +
png('content_sizes.png')
 +
hist(log10(sizes))
 +
dev.off()
 +
</pre>

Latest revision as of 15:13, 19 October 2017

Results

Last run on 2017-10-19 12:38:24 +0000, querying the replica DB on somerset.internal.softwareheritage.org.

Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000e+00 1.052e+03 3.387e+03 8.011e+04 1.247e+04 1.314e+09

Unit is length in bytes of uncompressed contents.

Sample size: 3.837.249 contents.

Content sizes.png

Reproducibility

SQL:

SELECT length
FROM content TABLESAMPLE BERNOULLI (0.1);  -- 0.1% of known blobs

Shell:

psql service=swh-replica -At -f content-size-1pct.sql | gzip -9c > content-size-1pct.txt.gz

R:

sizes <- scan(gzfile('content-size-1pct.txt.gz'))

length(sizes)
summary(sizes)

png('content_sizes.png')
hist(log10(sizes))
dev.off()