Content archiver blueprint

From Software Heritage Wiki
Revision as of 14:39, 9 May 2016 by Qcampos (talk) (Requirements)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Warning: this blueprint has been approved and checked-in the swh-storage Git repository. The current version of it is available via Git as doc/

Software Heritage Archiver

The Software Heritage (SWH) Archiver is responsible for backing up SWH objects as to reduce the risk of losing them.

Currently, the archiver only deals with content objects (i.e., those referenced by the content table in the DB and stored in the SWH object storage). The database itself is lively replicated by other means.


  • Master/slave architecture
    There is 1 master copy and 1 or more slave copies. There is a retention policy that defines the minimum number of needed copies of each object to stay on the safe side.
  • Append-only archival
    The archiver treat master as read-only storage. The archiver writes to slave storages append-only, never deleting any previously archived object. If removals are needed, in either master or slave, they will be dealt with by means other than the archiver.
  • Asynchronous archival.
    Periodically (e.g., via cron), the archiver kicks in, produces a list of the objects that need to be copied from master to slaves, and starts copying objects as needed. Very likely, during any given archival run other objects that need replication will be added. It will not be up to that archival run to replicate them, but of future runs.
  • Integrity at archival time.
    Before copying objects from master to slaves, the archiver performs integrity checks on the objects that are in need of replication. For instance, content objects are verified to ensure that they can be decompressed and that their content match their (checksum-based) unique identifiers. Corrupt objects will not be archived and suitable errors reporting about the corruption will be emitted.
    Note that archival-time integrity checks are not meant to replace periodic integrity-checks on all master/slaves; they are still needed!
  • Parallel archival
    Once the list of objects to be archived in a given run has been identified, it SHOULD be possible to archive them in parallel w.r.t. one another.
  • Persistent archival status
    The archiver maintains a mapping between objects and the locations where they are stored. Locations are the set {master, slave_1, ..., slave_n}.
    Each object is also associated to the following information:
    • status: 3-state: missing (copy not present at destination), ongoing (copy to destination ongoing), present (copy present at destination)
    • mtime: timestamp of last status change. In practice this is either the destination archival time (status=present), or the timestamp of the last archival request (status=ongoing)


The archiver is comprised of the following software components:

  • archiver director
  • archiver worker
  • archiver copier

Archiver director

The archiver director is run periodically, e.g., via cron.

Runtime parameters:

  • execution periodicity (external)
  • retention policy
  • archival max age
  • archival batch size

At each execution the director:

  1. for each object: retrieve its archival status
  2. for each object that is in the master storage but has less copies than requested by retention policy:
    1. if status=ongoing and mtime is not older than archival max age
      then continue to next object
    2. check object integrity (e.g., with
    3. mark object as needing archival
  3. group objects in need of archival in batches of archival batch size
  4. for each batch:
    1. set status=ongoing and mtime=now() on each object in the batch
    2. spawn an archive worker on the whole batch (e.g., submitting the relevant celery task)

Note that if an archiver worker task takes a long time (t > archival max age) to complete it is possible for another task to be scheduled on the same batch, or an overlapping one.

Archiver worker

The archiver is executed on demand (e.g., by a celery worker) to archive a given set of objects.

Runtime parameters:

  • objects to archive

At each execution a worker:

  1. create empty map { destinations -> objects that need to be copied there }
  2. for each object to archive:
    1. retrieve current archive status
    2. update the map noting where the object need to be copied
  3. for each destination:
    1. look up in the map objects that need to be copied there
    2. copy all objects to destination using the copier
    3. set status=present and mtime=now() on all copied objects

Note that:

  • In case multiple jobs where tasked to archive the same of overlapping objects, step (2.2) might decide that some/all objects of this batch no longer need to be archived to some/all destinations.
  • Due to parallelism, it is also possible that the same objects will be copied over at the same time by multiple workers.

Archiver copier

The copier is run on demand by archiver workers, to transfer a bunch of files from the master to a given destination.

The copier transfers all file together with a single network connection. The copying process is atomic at the file granularity (i.e., individual files might be visible on the destination before all files have been transferred) and ensures that concurrent transfer of the same files by multiple copier instances do not result in corrupted files. Note that, due to this and the fact that timestamps are updated by the director, all files copied in the same batch will have the same mtime even though the actual file creation times on a given destination might differ.

As a first approximation, the copier can be implemented using rsync, but a dedicated protocol can be devised in the future. In the case of rsync, we should use --files-from to list the file to be copied. We observe that rsync atomically rename files one-by-one during transfer; so as long as --inplace is not used, concurrent rsync of the same files should not be a problem.

DB structure

Postgres SQL definitions for the archival status:


CREATE TABLE archives (
  id   archive_id PRIMARY KEY,
  url  TEXT

CREATE TYPE archive_status AS ENUM (

CREATE TABLE content_archive (
  content_id  sha1 REFERENCES content(sha1),
  archive_id  archive_id REFERENCES archives(id),
  status      archive_status,
  mtime       timestamptz,
  PRIMARY KEY (content_id, archive_id)