Difference between revisions of "Repository snapshot objects"

From Software Heritage Wiki
Jump to: navigation, search
(add pointer to relevant phabricator tasks)
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''WARNING: work in progress blueprint'''
+
* '''WARNING: work in progress blueprint'''
 +
* '''INFO: the Software Heritage implementation of repository snapshot objects is tracked at [https://forge.softwareheritage.org/T565 T565]'''
  
----
+
== Introduction ==
  
A repository '''snapshot object''' is a [https://en.wikipedia.org/wiki/Merkle_tree Merkle] DAG node used to capture the state of a VCS repository.
+
A repository snapshot object, or simply '''snapshot object''', is a [https://en.wikipedia.org/wiki/Merkle_tree Merkle] DAG node used to capture the current state of a VCS repository.
  
Conceptually, a snapshot object is a map from branch names to other objects in the repository.
+
Conceptually, a snapshot object is a complete map from repository entry points ("branches" in [[Software Heritage]] terminology, "refs" in Git) to other objects in the repository,
 +
including other snapshot objects if repository entry points point to them.<br>
 +
Practically, the map is serialized into a '''manifest''' consisting of a list of triples ''<object type, object ID, branch name>''
 +
(specifying the object ''type'' would not be strictly needed, but it eases traversal, allows for better integrity checking, etc.).
  
Practically, the map is serialized to a '''manifest''' of triples ''<object type, object ID, branch name>'', sorted by branch name.<br>
+
Entries in snapshots can point to the following object kinds:
Currently supported object types are:
+
* contents (Git terminology: blobs)
* contents
+
* directories (tree)
* directories
+
* releases (annotated tags)
* releases
+
* revisions (commits)
* revisions
+
* snapshots
* snapshots (i.e., snapshot objects can recursively point to other snapshot objects)
 
  
Each snapshot object has as its '''object ID''' a cryptographic hash (the same used elsewhere in the Merkle DAG) of its manifest.
+
The object ID of a repository object is the cryptographic hash of its manifest, computed in the usual way for the Merkle DAG.
  
(''Note:'' "branch" is used here in the generic [[Software Heritage]] meaning, which encompasses branches, tags, etc., depending on the VCS.)
+
== Manifest ==
  
== Git implementation ==
+
The manifest of a repository object is a '''canonical representation''' of it as a sequence of bytes.<br>
 +
Two alternative serialization formats for such manifests are proposed below:
 +
* ''a-la Software Heritage'': how we would implement on our own, not taking into account compatibility with/stylistic choices of other VCSs
 +
* ''a-la Git'': manifest implementation similar to how Git implements manifests for other DAG objects
  
In the spirit of other [https://git-scm.com/book/en/v2/Git-Internals-Git-Objects Git objects], snapshot objects for Git repositories can be implemented as follows.
+
No matter the implementation of the manifest ''itself'', its '''object ID''' is obtained as follows:
 +
* take the string obtained by concatenating:
 +
# the header string "snapshot " (without quotes)
 +
# the length of the manifest serialized as a decimal integer in ASCII (e.g., "42")
 +
# the NULL byte "\0"
 +
# the manifest itself
 +
* compute the SHA1 checksum of the obtained string
 +
This is equivalent to the current implementation of [https://git-scm.com/docs/git-hash-object git-hash-object(1)] with object type set to "snapshot"
 +
(note that to use it you will need to pass <tt>--literally</tt>, as "snapshot" is currently not a supported Git object type).
  
<pre>
+
Note: branch/ref names might contain arbitrary characters except the NULL byte itself.
# create repo with some commits, branches, and tags
 
$ git init test
 
$ cd test/
 
$ echo foo > foo.txt
 
$ git add foo.txt
 
$ git commit -m 'checkin foo'
 
$ git branch foo
 
$ echo bar >> foo.txt
 
$ git commit -a -m 'add bar'
 
$ git tag bar
 
$ echo baz >> foo.txt
 
$ git commit -a -m 'add baz'
 
  
# ASSUMPTION: the output of git show-ref is sorted by ref name using
+
== Implementation ==
# the usual Git sort algorithm for textual object manifests. This is
 
# currently the case as of Git 2.8.1, but it is not documented
 
# behavior in git-show-ref(1).
 
  
# repository object in full (the manifest)
+
Two possible implementations of repository objects are detailed below, one more in line with Software Heritage conventions, the other more aligned with Git ones.
$ git show-ref | \
 
  while read id ref ; do
 
    type=$(git cat-file -t $id)
 
    echo $type $id $ref
 
  done \
 
  > /tmp/snapshot-object.txt
 
$ cat /tmp/snapshot-object.txt
 
commit 585f6e27f540012af621a18d0155aae2a8ec0276 refs/heads/foo
 
commit 6d976a397fe0b28a5bc59540e64f7f36a861af68 refs/heads/master
 
commit 521cb6d728f9fa3d6c4d73ddd309c0796ddf6995 refs/tags/bar
 
  
# repository object ID, as a Git SHA1
+
=== a-la Software Heritage ===
$ git hash-object -w --stdin --literally -t snapshot < /tmp/snapshot-object.txt
 
470d2daa27715987685708b816bf2b52ba5a47c8
 
  
# raw content of the repository object, including Git header
+
Repository objects will contain one entry for each branch that would be listed in the occurrence table while visiting an origin.
$ zlib-flate -uncompress < .git/objects/47/0d2daa27715987685708b816bf2b52ba5a47c8
+
(In this context "branches" roughly correspond to Git refs.)
snapshot 191commit 585f6e27f540012af621a18d0155aae2a8ec0276 refs/heads/foo
+
Each entry will point to a fully resolved object ID (i.e., a SHA).
commit 6d976a397fe0b28a5bc59540e64f7f36a861af68 refs/heads/master
+
The equivalent of Git symbolic refs are not stored in their non-resolved form (i.e., a ref name).
commit 521cb6d728f9fa3d6c4d73ddd309c0796ddf6995 refs/tags/bar
 
  
# i.e., a 191-byte long object of type "snapshot"
+
==== Manifest serialization ====
# (note that a "\0" before the first "commit" string has been stripped)
+
 
</pre>
+
A repository object manifest is a list of entries, named and sorted by branch, where each entry is as follows:
 +
# object kind (one of: "content", "directory", "release", "revision", "snapshot")
 +
# ASCII space
 +
# SHA1 of the target object serialized as a string of ASCII, lowercase hex digits (e.g., "585f6e27f540012af621a18d0155aae2a8ec0276")
 +
# ASCII space
 +
# entry name as a sequence of bytes (e.g., "refs/heads/master")
 +
# NULL byte "\0"
 +
 
 +
=== a-la Git ===
 +
 
 +
Repository objects will contain an entry for each ref that would exist in a bare repository.
 +
 
 +
Possible variants:
 +
* also store symbolic refs, in their non-resolved form (i.e., ref names) in the manifests. To be useful this needs the guarantee that refs pointed by symbolic refs are included in the manifest
 +
 
 +
==== Manifest serialization ====
 +
 
 +
A repository object manifest is a list of entries, named and sorted by ref, where each entry is as follows:
 +
# object kind (one of: "blob", "tree", "tag", "commit", "snapshot")
 +
# ASCII space
 +
# entry name as a sequence of bytes (e.g., "refs/heads/master")
 +
# NULL byte "\0"
 +
# SHA1 of the target object serialized as 20 raw bytes
 +
 
 +
Notes:
 +
* there is no separator between entries
 +
* the above is inspired by the compact serialization format of tree objects, but other variants are possible:
 +
** reorder the columns so that entry name comes last and "\0" acts as entry terminator
 +
** serialize SHA1 as ASCII instead of binary (it will take more space, but arguably snapshot objects will be less popular than tree objects)
  
  
 
[[Category:Blueprint]]
 
[[Category:Blueprint]]

Latest revision as of 14:38, 15 September 2016

  • WARNING: work in progress blueprint
  • INFO: the Software Heritage implementation of repository snapshot objects is tracked at T565

Introduction

A repository snapshot object, or simply snapshot object, is a Merkle DAG node used to capture the current state of a VCS repository.

Conceptually, a snapshot object is a complete map from repository entry points ("branches" in Software Heritage terminology, "refs" in Git) to other objects in the repository, including other snapshot objects if repository entry points point to them.
Practically, the map is serialized into a manifest consisting of a list of triples <object type, object ID, branch name> (specifying the object type would not be strictly needed, but it eases traversal, allows for better integrity checking, etc.).

Entries in snapshots can point to the following object kinds:

  • contents (Git terminology: blobs)
  • directories (tree)
  • releases (annotated tags)
  • revisions (commits)
  • snapshots

The object ID of a repository object is the cryptographic hash of its manifest, computed in the usual way for the Merkle DAG.

Manifest

The manifest of a repository object is a canonical representation of it as a sequence of bytes.
Two alternative serialization formats for such manifests are proposed below:

  • a-la Software Heritage: how we would implement on our own, not taking into account compatibility with/stylistic choices of other VCSs
  • a-la Git: manifest implementation similar to how Git implements manifests for other DAG objects

No matter the implementation of the manifest itself, its object ID is obtained as follows:

  • take the string obtained by concatenating:
  1. the header string "snapshot " (without quotes)
  2. the length of the manifest serialized as a decimal integer in ASCII (e.g., "42")
  3. the NULL byte "\0"
  4. the manifest itself
  • compute the SHA1 checksum of the obtained string

This is equivalent to the current implementation of git-hash-object(1) with object type set to "snapshot" (note that to use it you will need to pass --literally, as "snapshot" is currently not a supported Git object type).

Note: branch/ref names might contain arbitrary characters except the NULL byte itself.

Implementation

Two possible implementations of repository objects are detailed below, one more in line with Software Heritage conventions, the other more aligned with Git ones.

a-la Software Heritage

Repository objects will contain one entry for each branch that would be listed in the occurrence table while visiting an origin. (In this context "branches" roughly correspond to Git refs.) Each entry will point to a fully resolved object ID (i.e., a SHA). The equivalent of Git symbolic refs are not stored in their non-resolved form (i.e., a ref name).

Manifest serialization

A repository object manifest is a list of entries, named and sorted by branch, where each entry is as follows:

  1. object kind (one of: "content", "directory", "release", "revision", "snapshot")
  2. ASCII space
  3. SHA1 of the target object serialized as a string of ASCII, lowercase hex digits (e.g., "585f6e27f540012af621a18d0155aae2a8ec0276")
  4. ASCII space
  5. entry name as a sequence of bytes (e.g., "refs/heads/master")
  6. NULL byte "\0"

a-la Git

Repository objects will contain an entry for each ref that would exist in a bare repository.

Possible variants:

  • also store symbolic refs, in their non-resolved form (i.e., ref names) in the manifests. To be useful this needs the guarantee that refs pointed by symbolic refs are included in the manifest

Manifest serialization

A repository object manifest is a list of entries, named and sorted by ref, where each entry is as follows:

  1. object kind (one of: "blob", "tree", "tag", "commit", "snapshot")
  2. ASCII space
  3. entry name as a sequence of bytes (e.g., "refs/heads/master")
  4. NULL byte "\0"
  5. SHA1 of the target object serialized as 20 raw bytes

Notes:

  • there is no separator between entries
  • the above is inspired by the compact serialization format of tree objects, but other variants are possible:
    • reorder the columns so that entry name comes last and "\0" acts as entry terminator
    • serialize SHA1 as ASCII instead of binary (it will take more space, but arguably snapshot objects will be less popular than tree objects)