Software Heritage requirements

From Software Heritage Wiki
Jump to navigation Jump to search

Functional requirements

file content storage

  • raw files, without names/metadata, indexed by intrinsic hash
  • additional hashes or other attributes used to detect/avoid key collisions
  • goal: never lose an added file, possibly only hide them to the public (e.g., for DMCA take down)
  • remember "creation time", the moment some content has been first added to SWH

VCS storage

  • goal: be able to restore a working VCS developement environment, functionally equivalent to what devs have today
  • will be VCS-specific, no attempt to abstract/generalize over VCS differences
  • VCS storage should be (logically, at least) separate from file content storage
  • Git
    • store all Git objects (commits, trees, tags, blobs, etc.) in loose format
    • note: blobs are isomorphic to file content, as such we can ignore them (otherwise we will double space usage)
  • Subversion
  • CVS
  • Mercurial
  • ...

data sources

  • tree (actually: dag) structure to organize data sources
  • leaves are sources that we can crawl to retrieve:
    • either a tarball (single version of a set of files)
    • or a VCS snapshot (several versions of a set of files)
  • internal nodes are sources that we can list to retrieve:
    • a list of children for further processing, addition to the tree
  • example: debian → releases (e.g., jessie, wheezy, etc.) → source packages (tarball leaves)
  • example: github → repositories (VCS leaves)
  • the data structure needs to be versioned (e.g., what will http://debian.org point to in 100 years?)
  • update management: data sources will drive the update process
    • they can be manually added to bootstrap a specific crawling sub-space
    • they will be associated to refresh periods
      • some internal nodes will have "list all" policies (e.g., debian), other "list since last time" policies (e.g., github)
      • the same applies to leaves (e.g., tarball to be re-downloaded, or not vs VCS to be fetched/updated since last time)

occurrences

  • link stored content (both files and VCS objects) with its origin (data source) in the wild
    • both "creation occurrence", for first injection
    • and "seen occurrences" (plural), for every other spotting after first injection
  • should support addition of arbitrary context information, that depend on the source
    • e.g., file/pathname (all sources), current branch (VCS)
    • noteworthy special case: timestamps
      • we should distinguish between SWH timestamp
      • and timestamps that can be extracted from the data source (e.g., commit time, release time, upload time, etc)
  • seen occurrences might be updated lazily
    • e.g., updating the seen time of Git objects out of band, after having observed that the root commit hasn't changed since the last time

"project" metadata

  • data sources can be associated to project metadata, e.g., DOAP or other ad-hoc ontologies
  • metadata can evolve over time and should not be lost


Non-functional requirements

parallel processing

  • data injection will use remote workers, that interact with central storage via a fine-grained (REST) API

data integrity

  • protection against: human error, catastrophe, natural decay (e.g., bit flips)
  • file content and VCS storage
    • append-only backup (since day 1), at least 2 copies
    • mirrored (later on, possibly also partially)
    • periodic fsck against storage key + restore from copies upon checksum mismatches
  • everything else (~= Postgres DB)
    • backup
    • replicated (?)

audit / logging

  • SWH maintenance/update tasks will be coordinated using distributed workers and a centralized job queue
  • logs of all data manipulation actions will be persistent and potentially reproducible (assuming the source is still available)