Difference between revisions of "Statistics/Content size"

From Software Heritage Wiki
Jump to navigation Jump to search
Line 27: Line 27:
 
== Reproducibility ==
 
== Reproducibility ==
  
 +
SQL:
 
<pre>
 
<pre>
 
SELECT length
 
SELECT length
 
FROM content TABLESAMPLE BERNOULLI (0.1);  -- 0.1% of known blobs
 
FROM content TABLESAMPLE BERNOULLI (0.1);  -- 0.1% of known blobs
 +
</pre>
 +
 +
R:
 +
<pre>
 +
summary(sizes)
 +
 +
png('content_sizes.png')
 +
hist(log10(sizes))
 +
dev.off()
 
</pre>
 
</pre>

Revision as of 14:54, 19 October 2017

Results

Last run on 2017-10-19 12:38:24 +0000, querying the replica DB on somerset.internal.softwareheritage.org.

Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000e+00 1.052e+03 3.387e+03 8.011e+04 1.247e+04 1.314e+09

Unit is length in bytes of uncompressed contents.

Sample size: 3.837.249 contents.


Reproducibility

SQL:

SELECT length
FROM content TABLESAMPLE BERNOULLI (0.1);  -- 0.1% of known blobs

R:

summary(sizes)

png('content_sizes.png')
hist(log10(sizes))
dev.off()