<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.softwareheritage.org/index.php?action=history&amp;feed=atom&amp;title=Compression_ratio</id>
	<title>Compression ratio - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.softwareheritage.org/index.php?action=history&amp;feed=atom&amp;title=Compression_ratio"/>
	<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Compression_ratio&amp;action=history"/>
	<updated>2026-04-20T21:48:02Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.10</generator>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Compression_ratio&amp;diff=169&amp;oldid=prev</id>
		<title>StefanoZacchiroli: 1 revision: import public pages from the intranet wiki</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Compression_ratio&amp;diff=169&amp;oldid=prev"/>
		<updated>2016-07-20T13:02:34Z</updated>

		<summary type="html">&lt;p&gt;1 revision: import public pages from the intranet wiki&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;1&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;1&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 13:02, 20 July 2016&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-notice&quot; lang=&quot;en&quot;&gt;&lt;div class=&quot;mw-diff-empty&quot;&gt;(No difference)&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Compression_ratio&amp;diff=168&amp;oldid=prev</id>
		<title>StefanoZacchiroli: /* Computing compression ratio */</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Compression_ratio&amp;diff=168&amp;oldid=prev"/>
		<updated>2016-01-16T14:18:30Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;Computing compression ratio&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Files are stored in the [[Software Heritage]] [[object storage]] individually, and each file is compressed using Gzip independently from other files.&lt;br /&gt;
&lt;br /&gt;
The actual '''compression ratio''' has been experimentally evaluated to '''45.53%''' (compressed size / uncompressed size * 100), i.e., '''2.20:1'''.&lt;br /&gt;
&lt;br /&gt;
== Experimental evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate the compression ratio we have randomly selected ~1.6 M files (~0.1% of the size of the object storage at the time of writing).&lt;br /&gt;
For each file we have then extracted the original/uncompressed size from the content table.&lt;br /&gt;
Finally, for each file we have computed the compressed looking at the actual file on disk with du --bytes.&lt;br /&gt;
&lt;br /&gt;
On the resulting table object_id/compressed_size/original_size, we have computed the ratio sum(compressed_size)/sum(original_size).&lt;br /&gt;
&lt;br /&gt;
Here are the final results:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
uffizi$ xzcat content-random-0.1pct-id-sizes.txt.xz | ./avg-compression &lt;br /&gt;
&lt;br /&gt;
objects:		1626419&lt;br /&gt;
comp. ratio (file avg):	51.33%&lt;br /&gt;
orig. size (total):	106683965488&lt;br /&gt;
comp. size (total):	48568294853&lt;br /&gt;
comp. ratio (total):	45.53%&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Reproducibility ===&lt;br /&gt;
&lt;br /&gt;
==== Random sample generation ====&lt;br /&gt;
&lt;br /&gt;
To obtain a random sample of object_id, (uncompressed) size rows:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
echo &amp;quot;copy (select sha1, length from content where random() &amp;lt; 0.001) to stdout;&amp;quot; | psql softwareheritage | cut -c 4- | gzip -c &amp;gt; random-sample.txt.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notes:&lt;br /&gt;
* the cut is needed to remove heading '\\x' escape sequence&lt;br /&gt;
* re-running the above query with a larger object storage will likely give ''more'' than 1.6M objects&lt;br /&gt;
* the above took about ~10 minutes&lt;br /&gt;
&lt;br /&gt;
==== Compressed size calculation ====&lt;br /&gt;
&lt;br /&gt;
To compute uncompressed size of each object, passing in pipe the list generated in the previous step:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
uffizi$ cat ls-obj-size    &lt;br /&gt;
#!/bin/bash&lt;br /&gt;
echo -e &amp;quot;# object\torig_size\tcomp_size&amp;quot;&lt;br /&gt;
while read obj_id orig_size ; do&lt;br /&gt;
    obj_file=$($HOME/bin/swh-ls-obj $obj_id)&lt;br /&gt;
    # orig_size=$(zcat $obj_file | wc --bytes)&lt;br /&gt;
    comp_size=$(du --bytes $obj_file | cut -f 1)&lt;br /&gt;
    echo -e &amp;quot;${obj_id}\t${orig_size}\t${comp_size}&amp;quot;&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
a dependency of the above is the general utility swh-ls-obj:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
uffizi$ cat bin/swh-ls-obj &lt;br /&gt;
#!/bin/bash&lt;br /&gt;
&lt;br /&gt;
OBJ_ROOT=&amp;quot;/srv/softwareheritage/objects&amp;quot;&lt;br /&gt;
&lt;br /&gt;
if [ -z &amp;quot;$1&amp;quot; ] ; then&lt;br /&gt;
    echo &amp;quot;Usage: swh-ls-obj OBJECT_ID [LS_OPTION]...&amp;quot;&lt;br /&gt;
    exit 1&lt;br /&gt;
fi&lt;br /&gt;
obj_id=&amp;quot;$1&amp;quot;&lt;br /&gt;
shift&lt;br /&gt;
&lt;br /&gt;
ls &amp;quot;$@&amp;quot; &amp;quot;${OBJ_ROOT}/${obj_id:0:2}/${obj_id:2:2}/${obj_id:4:2}/${obj_id}&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Running ls-obj-size on the ~1.6M took a bit less than 2 days.&lt;br /&gt;
&lt;br /&gt;
==== Compression ratio calculation ====&lt;br /&gt;
&lt;br /&gt;
Pass the output of ls-obj-size in pipe to:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
uffizi$ cat avg-compression &lt;br /&gt;
#!/usr/bin/python3&lt;br /&gt;
&lt;br /&gt;
import sys&lt;br /&gt;
&lt;br /&gt;
comp_ratios = []&lt;br /&gt;
tot_orig_size = 0&lt;br /&gt;
tot_comp_size = 0&lt;br /&gt;
&lt;br /&gt;
for line in sys.stdin:&lt;br /&gt;
    if line.startswith('#'):&lt;br /&gt;
        continue&lt;br /&gt;
    (_obj_id, orig_size, comp_size) = line.split()&lt;br /&gt;
    (orig_size, comp_size) = (int(orig_size), int(comp_size))&lt;br /&gt;
&lt;br /&gt;
    tot_orig_size += orig_size&lt;br /&gt;
    tot_comp_size += comp_size&lt;br /&gt;
    comp_ratios.append(comp_size / orig_size)&lt;br /&gt;
&lt;br /&gt;
tot_comp_ratio = tot_comp_size / tot_orig_size * 100&lt;br /&gt;
avg_comp_ratio = sum(comp_ratios) / len(comp_ratios) * 100&lt;br /&gt;
&lt;br /&gt;
print('objects:\t\t%d' % len(comp_ratios))&lt;br /&gt;
print('comp. ratio (file avg):\t%.2f%%' % avg_comp_ratio)&lt;br /&gt;
print('orig. size (total):\t%d' % tot_orig_size)&lt;br /&gt;
print('comp. size (total):\t%d' % tot_comp_size)&lt;br /&gt;
print('comp. ratio (total):\t%.2f%%' % tot_comp_ratio)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
</feed>