Software Heritage requirements
Jump to navigation
Jump to search
Functional requirements
file content storage
- raw files, without names/metadata, indexed by intrinsic hash
- additional hashes or other attributes used to detect/avoid key collisions
- goal: never lose an added file, possibly only hide them to the public (e.g., for DMCA take down)
- remember "creation time", the moment some content has been first added to SWH
VCS storage
- goal: be able to restore a working VCS developement environment, functionally equivalent to what devs have today
- will be VCS-specific, no attempt to abstract/generalize over VCS differences
- VCS storage should be (logically, at least) separate from file content storage
- Git
- store all Git objects (commits, trees, tags, blobs, etc.) in loose format
- note: blobs are isomorphic to file content, as such we can ignore them (otherwise we will double space usage)
- Subversion
- CVS
- Mercurial
- ...
data sources
- tree (actually: dag) structure to organize data sources
- leaves are sources that we can crawl to retrieve:
- either a tarball (single version of a set of files)
- or a VCS snapshot (several versions of a set of files)
- internal nodes are sources that we can list to retrieve:
- a list of children for further processing, addition to the tree
- example: debian → releases (e.g., jessie, wheezy, etc.) → source packages (tarball leaves)
- example: github → repositories (VCS leaves)
- the data structure needs to be versioned (e.g., what will http://debian.org point to in 100 years?)
- update management: data sources will drive the update process
- they can be manually added to bootstrap a specific crawling sub-space
- they will be associated to refresh periods
- some internal nodes will have "list all" policies (e.g., debian), other "list since last time" policies (e.g., github)
- the same applies to leaves (e.g., tarball to be re-downloaded, or not vs VCS to be fetched/updated since last time)
occurrences
- link stored content (both files and VCS objects) with its origin (data source) in the wild
- both "creation occurrence", for first injection
- and "seen occurrences" (plural), for every other spotting after first injection
- should support addition of arbitrary context information, that depend on the source
- e.g., file/pathname (all sources), current branch (VCS)
- noteworthy special case: timestamps
- we should distinguish between SWH timestamp
- and timestamps that can be extracted from the data source (e.g., commit time, release time, upload time, etc)
- seen occurrences might be updated lazily
- e.g., updating the seen time of Git objects out of band, after having observed that the root commit hasn't changed since the last time
"project" metadata
- data sources can be associated to project metadata, e.g., DOAP or other ad-hoc ontologies
- metadata can evolve over time and should not be lost
Non-functional requirements
parallel processing
- data injection will use remote workers, that interact with central storage via a fine-grained (REST) API
data integrity
- protection against: human error, catastrophe, natural decay (e.g., bit flips)
- file content and VCS storage
- append-only backup (since day 1), at least 2 copies
- mirrored (later on, possibly also partially)
- periodic fsck against storage key + restore from copies upon checksum mismatches
- everything else (~= Postgres DB)
- backup
- replicated (?)
audit / logging
- SWH maintenance/update tasks will be coordinated using distributed workers and a centralized job queue
- logs of all data manipulation actions will be persistent and potentially reproducible (assuming the source is still available)
- e.g., using Apache Kafka