Compression ratio
Files are stored in the Software Heritage object storage individually, and each file is compressed using Gzip independently from other files.
The actual compression ratio has been experimentally evaluated to 45.53% (compressed size / uncompressed size * 100), i.e., 2.20:1.
Experimental evaluation
To evaluate the compression ratio we have randomly selected ~1.6 M files (~0.1% of the size of the object storage at the time of writing). For each file we have then extracted the original/uncompressed size from the content table. Finally, for each file we have computed the compressed looking at the actual file on disk with du --bytes.
On the resulting table object_id/compressed_size/original_size, we have computed the ratio sum(compressed_size)/sum(original_size).
Here are the final results:
uffizi$ xzcat content-random-0.1pct-id-sizes.txt.xz | ./avg-compression objects: 1626419 comp. ratio (file avg): 51.33% orig. size (total): 106683965488 comp. size (total): 48568294853 comp. ratio (total): 45.53%
Reproducibility
Random sample generation
To obtain a random sample of object_id, (uncompressed) size rows:
echo "copy (select sha1, length from content where random() < 0.001) to stdout;" | psql softwareheritage | cut -c 4- | gzip -c > random-sample.txt.gz
Notes:
- the cut is needed to remove heading '\\x' escape sequence
- re-running the above query with a larger object storage will likely give more than 1.6M objects
- the above took about ~10 minutes
Compressed size calculation
To compute uncompressed size of each object, passing in pipe the list generated in the previous step:
uffizi$ cat ls-obj-size #!/bin/bash echo -e "# object\torig_size\tcomp_size" while read obj_id orig_size ; do obj_file=$($HOME/bin/swh-ls-obj $obj_id) # orig_size=$(zcat $obj_file | wc --bytes) comp_size=$(du --bytes $obj_file | cut -f 1) echo -e "${obj_id}\t${orig_size}\t${comp_size}" done
a dependency of the above is the general utility swh-ls-obj:
uffizi$ cat bin/swh-ls-obj #!/bin/bash OBJ_ROOT="/srv/softwareheritage/objects" if [ -z "$1" ] ; then echo "Usage: swh-ls-obj OBJECT_ID [LS_OPTION]..." exit 1 fi obj_id="$1" shift ls "$@" "${OBJ_ROOT}/${obj_id:0:2}/${obj_id:2:2}/${obj_id:4:2}/${obj_id}"
Running ls-obj-size on the ~1.6M took a bit less than 2 days.
Compression ratio calculation
Pass the output of ls-obj-size in pipe to:
uffizi$ cat avg-compression #!/usr/bin/python3 import sys comp_ratios = [] tot_orig_size = 0 tot_comp_size = 0 for line in sys.stdin: if line.startswith('#'): continue (_obj_id, orig_size, comp_size) = line.split() (orig_size, comp_size) = (int(orig_size), int(comp_size)) tot_orig_size += orig_size tot_comp_size += comp_size comp_ratios.append(comp_size / orig_size) tot_comp_ratio = tot_comp_size / tot_orig_size * 100 avg_comp_ratio = sum(comp_ratios) / len(comp_ratios) * 100 print('objects:\t\t%d' % len(comp_ratios)) print('comp. ratio (file avg):\t%.2f%%' % avg_comp_ratio) print('orig. size (total):\t%d' % tot_orig_size) print('comp. size (total):\t%d' % tot_comp_size) print('comp. ratio (total):\t%.2f%%' % tot_comp_ratio)