Difference between revisions of "Repository snapshot objects"

From Software Heritage Wiki
Jump to: navigation, search
m (Manifest)
m (Manifest: prefer numbered lists over unordered lists where appropriate (markup))
Line 26: Line 26:
 
No matter the implementation of the manifest ''itself'', its '''object ID''' is obtained as follows:
 
No matter the implementation of the manifest ''itself'', its '''object ID''' is obtained as follows:
 
* take the string obtained by concatenating:
 
* take the string obtained by concatenating:
** the header string "snapshot " (without quotes)
+
# the header string "snapshot " (without quotes)
** the length of the manifest serialized as a decimal integer in ASCII (e.g., "42")
+
# the length of the manifest serialized as a decimal integer in ASCII (e.g., "42")
** the NULL byte "\0"
+
# the NULL byte "\0"
** the manifest itself
+
# the manifest itself
 
* compute the SHA1 checksum of the obtained string
 
* compute the SHA1 checksum of the obtained string
 
This is equivalent to the current implementation of [https://git-scm.com/docs/git-hash-object git-hash-object(1)] with object type set to "snapshot"
 
This is equivalent to the current implementation of [https://git-scm.com/docs/git-hash-object git-hash-object(1)] with object type set to "snapshot"
Line 39: Line 39:
  
 
A repository object manifest is a list of entries, named and sorted by branch, where each entry is as follows:
 
A repository object manifest is a list of entries, named and sorted by branch, where each entry is as follows:
* object kind (one of: "content", "directory", "release", "revision", "snapshot")
+
# object kind (one of: "content", "directory", "release", "revision", "snapshot")
* ASCII space
+
# ASCII space
* SHA1 of the target object serialized as a string of ASCII, lowercase hex digits (e.g., "585f6e27f540012af621a18d0155aae2a8ec0276")
+
# SHA1 of the target object serialized as a string of ASCII, lowercase hex digits (e.g., "585f6e27f540012af621a18d0155aae2a8ec0276")
* ASCII space
+
# ASCII space
* entry name as a sequence of bytes (e.g., "refs/heads/master")
+
# entry name as a sequence of bytes (e.g., "refs/heads/master")
* NULL byte "\0"
+
# NULL byte "\0"
  
 
=== Serialization a-la Git ===
 
=== Serialization a-la Git ===
  
 
A repository object manifest is a list of entries, named and sorted by ref, where each entry is as follows:
 
A repository object manifest is a list of entries, named and sorted by ref, where each entry is as follows:
* object kind (one of: "blob", "tree", "tag", "commit", "snapshot")
+
# object kind (one of: "blob", "tree", "tag", "commit", "snapshot")
* ASCII space
+
# ASCII space
* entry name as a sequence of bytes (e.g., "refs/heads/master")
+
# entry name as a sequence of bytes (e.g., "refs/heads/master")
* NULL byte "\0"
+
# NULL byte "\0"
* SHA1 of the target object serialized as 20 raw bytes
+
# SHA1 of the target object serialized as 20 raw bytes
  
 
Notes:
 
Notes:

Revision as of 07:23, 16 August 2016

WARNING: work in progress blueprint

Introduction

A repository snapshot object, or simply snapshot object, is a Merkle DAG node used to capture the current state of a VCS repository.

Conceptually, a snapshot object is a complete map from repository entry points ("branches" in Software Heritage terminology, "refs" in Git) to other objects in the repository, including other snapshot objects if repository entry points point to them.
Practically, the map is serialized into a manifest consisting of a list of triples <object type, object ID, branch name>.

Entries in snapshots can point to the following object kinds:

  • contents (Git terminology: blobs)
  • directories (tree)
  • releases (annotated tags)
  • revisions (commits)
  • snapshots

The object ID of a repository object is the cryptographic hash of its manifest, computed in the usual way for the Merkle DAG.

Manifest

The manifest of a repository object is a canonical representation of it as a sequence of bytes.
Two alternative serialization formats for such manifests are proposed below:

  • a-la Software Heritage: how we would implement on our own, not taking into account compatibility with/stylistic choices of other VCSs
  • a-la Git: manifest implementation similar to how Git implements manifests for other DAG objects

No matter the implementation of the manifest itself, its object ID is obtained as follows:

  • take the string obtained by concatenating:
  1. the header string "snapshot " (without quotes)
  2. the length of the manifest serialized as a decimal integer in ASCII (e.g., "42")
  3. the NULL byte "\0"
  4. the manifest itself
  • compute the SHA1 checksum of the obtained string

This is equivalent to the current implementation of git-hash-object(1) with object type set to "snapshot" (note that to use it you will need to pass --literally, as "snapshot" is currently not a supported Git object type).

Note: branch/ref names might contain arbitrary characters except the NULL byte itself.

Serialization a-la Software Heritage

A repository object manifest is a list of entries, named and sorted by branch, where each entry is as follows:

  1. object kind (one of: "content", "directory", "release", "revision", "snapshot")
  2. ASCII space
  3. SHA1 of the target object serialized as a string of ASCII, lowercase hex digits (e.g., "585f6e27f540012af621a18d0155aae2a8ec0276")
  4. ASCII space
  5. entry name as a sequence of bytes (e.g., "refs/heads/master")
  6. NULL byte "\0"

Serialization a-la Git

A repository object manifest is a list of entries, named and sorted by ref, where each entry is as follows:

  1. object kind (one of: "blob", "tree", "tag", "commit", "snapshot")
  2. ASCII space
  3. entry name as a sequence of bytes (e.g., "refs/heads/master")
  4. NULL byte "\0"
  5. SHA1 of the target object serialized as 20 raw bytes

Notes:

  • there is no separator between entries
  • the above is inspired by the compact serialization format of tree objects, but other variants are possible:
    • reorder the columns so that entry name comes last and "\0" acts as entry terminator
    • serialize SHA1 as ASCII instead of binary (it will take more space, but arguably snapshot objects will be less popular than tree objects)