Compression ratio

From Software Heritage Wiki
Revision as of 13:02, 20 July 2016 by StefanoZacchiroli (talk | contribs) (1 revision: import public pages from the intranet wiki)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Files are stored in the Software Heritage object storage individually, and each file is compressed using Gzip independently from other files.

The actual compression ratio has been experimentally evaluated to 45.53% (compressed size / uncompressed size * 100), i.e., 2.20:1.

Experimental evaluation

To evaluate the compression ratio we have randomly selected ~1.6 M files (~0.1% of the size of the object storage at the time of writing). For each file we have then extracted the original/uncompressed size from the content table. Finally, for each file we have computed the compressed looking at the actual file on disk with du --bytes.

On the resulting table object_id/compressed_size/original_size, we have computed the ratio sum(compressed_size)/sum(original_size).

Here are the final results:

uffizi$ xzcat content-random-0.1pct-id-sizes.txt.xz | ./avg-compression 

objects:		1626419
comp. ratio (file avg):	51.33%
orig. size (total):	106683965488
comp. size (total):	48568294853
comp. ratio (total):	45.53%

Reproducibility

Random sample generation

To obtain a random sample of object_id, (uncompressed) size rows:

echo "copy (select sha1, length from content where random() < 0.001) to stdout;" | psql softwareheritage | cut -c 4- | gzip -c > random-sample.txt.gz

Notes:

  • the cut is needed to remove heading '\\x' escape sequence
  • re-running the above query with a larger object storage will likely give more than 1.6M objects
  • the above took about ~10 minutes

Compressed size calculation

To compute uncompressed size of each object, passing in pipe the list generated in the previous step:

uffizi$ cat ls-obj-size    
#!/bin/bash
echo -e "# object\torig_size\tcomp_size"
while read obj_id orig_size ; do
    obj_file=$($HOME/bin/swh-ls-obj $obj_id)
    # orig_size=$(zcat $obj_file | wc --bytes)
    comp_size=$(du --bytes $obj_file | cut -f 1)
    echo -e "${obj_id}\t${orig_size}\t${comp_size}"
done

a dependency of the above is the general utility swh-ls-obj:

uffizi$ cat bin/swh-ls-obj 
#!/bin/bash

OBJ_ROOT="/srv/softwareheritage/objects"

if [ -z "$1" ] ; then
    echo "Usage: swh-ls-obj OBJECT_ID [LS_OPTION]..."
    exit 1
fi
obj_id="$1"
shift

ls "$@" "${OBJ_ROOT}/${obj_id:0:2}/${obj_id:2:2}/${obj_id:4:2}/${obj_id}"

Running ls-obj-size on the ~1.6M took a bit less than 2 days.

Compression ratio calculation

Pass the output of ls-obj-size in pipe to:

uffizi$ cat avg-compression 
#!/usr/bin/python3

import sys

comp_ratios = []
tot_orig_size = 0
tot_comp_size = 0

for line in sys.stdin:
    if line.startswith('#'):
        continue
    (_obj_id, orig_size, comp_size) = line.split()
    (orig_size, comp_size) = (int(orig_size), int(comp_size))

    tot_orig_size += orig_size
    tot_comp_size += comp_size
    comp_ratios.append(comp_size / orig_size)

tot_comp_ratio = tot_comp_size / tot_orig_size * 100
avg_comp_ratio = sum(comp_ratios) / len(comp_ratios) * 100

print('objects:\t\t%d' % len(comp_ratios))
print('comp. ratio (file avg):\t%.2f%%' % avg_comp_ratio)
print('orig. size (total):\t%d' % tot_orig_size)
print('comp. size (total):\t%d' % tot_comp_size)
print('comp. ratio (total):\t%.2f%%' % tot_comp_ratio)