Difference between revisions of "Content archiver blueprint"

From Software Heritage Wiki
Jump to navigation Jump to search
m (1 revision: import public pages from the intranet wiki)
 
(updated Git link and remove obsolete content)
Line 1: Line 1:
'''Warning:''' this blueprint has been approved and checked-in the swh-storage Git repository. The current version of it is available via Git as [https://forge.softwareheritage.org/diffusion/DSTO/browse/master/doc/archiver-blueprint.md doc/archiver-bluerprint.md].
+
This blueprint has been approved and checked-in the swh-storage Git repository. The current version of it is available via Git as [https://forge.softwareheritage.org/diffusion/DSTO/browse/master/docs/archiver-blueprint.md archiver-blueprint.md].
  
= Software Heritage Archiver =
+
See the history of this wiki page for older versions of the spec.
 
 
The Software Heritage (SWH) Archiver is responsible for backing up SWH objects as to reduce the risk of losing them.
 
 
 
Currently, the archiver only deals with content objects (i.e., those referenced by the content table in the DB and stored in the SWH object storage). The database itself is lively replicated by other means.
 
 
 
== Requirements ==
 
 
 
* '''Master/slave architecture'''<br> There is 1 master copy and 1 or more slave copies. There is a retention policy that defines the minimum number of needed copies of each object to stay on the safe side.
 
* '''Append-only archival'''<br> The archiver treat master as read-only storage. The archiver writes to slave storages append-only, never deleting any previously archived object. If removals are needed, in either master or slave, they will be dealt with by means other than the archiver.
 
* '''Asynchronous archival.'''<br> Periodically (e.g., via cron), the archiver kicks in, produces a list of the objects that need to be copied from master to slaves, and starts copying objects as needed. Very likely, during any given archival run other objects that need replication will be added. It will ''not'' be up to that archival run to replicate them, but of future runs.
 
* '''Integrity at archival time.'''<br> Before copying objects from master to slaves, the archiver performs integrity checks on the objects that are in need of replication. For instance, content objects are verified to ensure that they can be decompressed and that their content match their (checksum-based) unique identifiers. Corrupt objects will not be archived and suitable errors reporting about the corruption will be emitted.<br> Note that archival-time integrity checks are not meant to replace periodic integrity-checks on all master/slaves; they are still needed!
 
* '''Parallel archival'''<br> Once the list of objects to be archived in a given run has been identified, it SHOULD be possible to archive them in parallel w.r.t. one another.
 
* '''Persistent archival status'''<br> The archiver maintains a mapping between objects and the locations where they are stored. Locations are the set {master, slave_1, ..., slave_n}.<br> Each object is also associated to the following information:
 
** '''status''': 3-state: missing (copy not present at destination), ongoing (copy to destination ongoing), present (copy present at destination)
 
** '''mtime''': timestamp of last status change. In practice this is either the destination archival time (status=present), or the timestamp of the last archival request (status=ongoing)
 
 
 
== Architecture ==
 
 
 
The archiver is comprised of the following software components:
 
 
 
* archiver director
 
* archiver worker
 
* archiver copier
 
 
 
=== Archiver director ===
 
 
 
The archiver director is run periodically, e.g., via cron.
 
 
 
Runtime parameters:
 
 
 
* execution periodicity (external)
 
* retention policy
 
* archival max age
 
* archival batch size
 
 
 
At each execution the director:
 
 
 
# for each object: retrieve its archival status
 
# for each object that is in the master storage but has less copies than requested by retention policy:
 
## if status=ongoing and mtime is not older than archival max age<br /> then continue to next object
 
## check object integrity (e.g., with swh.storage.ObjStorage.check(obj_id))
 
## mark object as needing archival
 
# group objects in need of archival in batches of archival batch size
 
# for each batch:
 
## set status=ongoing and mtime=now() on each object in the batch
 
## spawn an archive worker on the whole batch (e.g., submitting the relevant celery task)
 
 
 
Note that if an archiver worker task takes a long time (t &gt; archival max age) to complete it is possible for another task to be scheduled on the same batch, or an overlapping one.
 
 
 
=== Archiver worker ===
 
 
 
The archiver is executed on demand (e.g., by a celery worker) to archive a given set of objects.
 
 
 
Runtime parameters:
 
 
 
* objects to archive
 
 
 
At each execution a worker:
 
 
 
# create empty map { destinations -&gt; objects that need to be copied there }
 
# for each object to archive:
 
## retrieve current archive status
 
## update the map noting where the object need to be copied
 
# for each destination:
 
## look up in the map objects that need to be copied there
 
## copy all objects to destination using the copier
 
## set status=present and mtime=now() on all copied objects
 
 
 
Note that:
 
 
 
* In case multiple jobs where tasked to archive the same of overlapping objects, step (2.2) might decide that some/all objects of this batch no longer need to be archived to some/all destinations.
 
* Due to parallelism, it is also possible that the same objects will be copied over at the same time by multiple workers.
 
 
 
=== Archiver copier ===
 
 
 
The copier is run on demand by archiver workers, to transfer a bunch of files from the master to a given destination.
 
 
 
The copier transfers all file together with a single network connection. The copying process is atomic at the file granularity (i.e., individual files might be visible on the destination before ''all'' files have been transferred) and ensures that concurrent transfer of the same files by multiple copier instances do not result in corrupted files.  Note that, due to this and the fact that timestamps are updated by the director, all files copied in the same batch will have the same mtime even though the actual file creation times on a given destination might differ.
 
 
 
As a first approximation, the copier can be implemented using rsync, but a dedicated protocol can be devised in the future. In the case of rsync, we should use --files-from to list the file to be copied. We observe that rsync atomically rename files one-by-one during transfer; so as long as --inplace is ''not'' used, concurrent rsync of the same files should not be a problem.
 
 
 
== DB structure ==
 
 
 
Postgres SQL definitions for the archival status:
 
 
 
<pre>CREATE DOMAIN archive_id AS TEXT;
 
 
 
CREATE TABLE archives (
 
  id  archive_id PRIMARY KEY,
 
  url  TEXT
 
);
 
 
 
CREATE TYPE archive_status AS ENUM (
 
  'missing',
 
  'ongoing',
 
  'present'
 
);
 
 
 
CREATE TABLE content_archive (
 
  content_id  sha1 REFERENCES content(sha1),
 
  archive_id  archive_id REFERENCES archives(id),
 
  status      archive_status,
 
  mtime      timestamptz,
 
  PRIMARY KEY (content_id, archive_id)
 
);</pre>
 
 
 
 
 
[[Category:Blueprint]]
 

Revision as of 12:53, 1 August 2016

This blueprint has been approved and checked-in the swh-storage Git repository. The current version of it is available via Git as archiver-blueprint.md.

See the history of this wiki page for older versions of the spec.