Software Heritage requirements

Functional requirements

raw files, without names/metadata, indexed by intrinsic hash
additional hashes or other attributes used to detect/avoid key collisions
goal: never lose an added file, possibly only hide them to the public (e.g., for DMCA take down)
remember "creation time", the moment some content has been first added to SWH

goal: be able to restore a working VCS developement environment, functionally equivalent to what devs have today
will be VCS-specific, no attempt to abstract/generalize over VCS differences
VCS storage should be (logically, at least) separate from file content storage
Git
- store all Git objects (commits, trees, tags, blobs, etc.) in loose format
- note: blobs are isomorphic to file content, as such we can ignore them (otherwise we will double space usage)
Subversion
CVS
Mercurial
...

tree (actually: dag) structure to organize data sources
leaves are sources that we can crawl to retrieve:
- either a tarball (single version of a set of files)
- or a VCS snapshot (several versions of a set of files)
internal nodes are sources that we can list to retrieve:
- a list of children for further processing, addition to the tree
example: debian → releases (e.g., jessie, wheezy, etc.) → source packages (tarball leaves)
example: github → repositories (VCS leaves)
the data structure needs to be versioned (e.g., what will http://debian.org point to in 100 years?)
update management: data sources will drive the update process
- they can be manually added to bootstrap a specific crawling sub-space
- they will be associated to refresh periods
  - some internal nodes will have "list all" policies (e.g., debian), other "list since last time" policies (e.g., github)
  - the same applies to leaves (e.g., tarball to be re-downloaded, or not vs VCS to be fetched/updated since last time)

link stored content (both files and VCS objects) with its origin (data source) in the wild
- both "creation occurrence", for first injection
- and "seen occurrences" (plural), for every other spotting after first injection
should support addition of arbitrary context information, that depend on the source
- e.g., file/pathname (all sources), current branch (VCS)
- noteworthy special case: timestamps
  - we should distinguish between SWH timestamp
  - and timestamps that can be extracted from the data source (e.g., commit time, release time, upload time, etc)
seen occurrences might be updated lazily
- e.g., updating the seen time of Git objects out of band, after having observed that the root commit hasn't changed since the last time

data sources can be associated to project metadata, e.g., DOAP or other ad-hoc ontologies
metadata can evolve over time and should not be lost

data injection will use remote workers, that interact with central storage via a fine-grained (REST) API

protection against: human error, catastrophe, natural decay (e.g., bit flips)
file content and VCS storage
- append-only backup (since day 1), at least 2 copies
- mirrored (later on, possibly also partially)
- periodic fsck against storage key + restore from copies upon checksum mismatches
everything else (~= Postgres DB)
- backup
- replicated (?)

SWH maintenance/update tasks will be coordinated using distributed workers and a centralized job queue
logs of all data manipulation actions will be persistent and potentially reproducible (assuming the source is still available)
- e.g., using Apache Kafka