<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.softwareheritage.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Dachary</id>
	<title>Software Heritage Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.softwareheritage.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Dachary"/>
	<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/wiki/Special:Contributions/Dachary"/>
	<updated>2026-04-20T15:12:05Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.10</generator>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Winery_Object_Storage_Benchmarks&amp;diff=1632</id>
		<title>Winery Object Storage Benchmarks</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Winery_Object_Storage_Benchmarks&amp;diff=1632"/>
		<updated>2022-01-25T20:46:30Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Created page with &amp;quot;= Introduction =  An object storage suitable to store over 100 billions immutable small objects [https://wiki.softwareheritage.org/wiki/A_practical_approach_to_efficiently_sto...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Introduction =&lt;br /&gt;
&lt;br /&gt;
An object storage suitable to store over 100 billions immutable small objects [https://wiki.softwareheritage.org/wiki/A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph was designed in March 2021]. An implementation was [https://forge.softwareheritage.org/source/swh-objstorage/browse/master/swh/objstorage/backends/winery/ completed in January 2022] and includes tools to benchmark it. They will be run against the hardware to be delivered first semester of 2022.&lt;br /&gt;
&lt;br /&gt;
= Methodology =&lt;br /&gt;
&lt;br /&gt;
A [https://forge.softwareheritage.org/source/swh-objstorage/browse/master/swh/objstorage/tests/winery_benchmark.py single process] drives the benchmark on a dedicated machine. It runs multiple write workers, each in a separate process (to avoid any lock contention) which report how many objects and bytes they wrote in a CSV file. It runs multiple read workers in parallel, also in separate processes, because they are blocking on reads from RBD images. The read workers report how many objects they read and how many bytes in total in a CSV file. The objects per second and bytes per second and other relevant measures are stored in files every 60 seconds, to be analyzed when the run completes.&lt;br /&gt;
&lt;br /&gt;
A write worker starts by writing objects in the Write Storage, using a payload obtained from a random source to avoid unintended gain due to easily compressible content. When the Shard is full in the Write Storage, a packer process starts and the benchmark will wait until it completes.&lt;br /&gt;
&lt;br /&gt;
A read worker reads a given number of Object ID from the Global Index and gets the associated content from the object storage. It has no way to know if it will be retrieved from the Read Storage or from the Write Storage. As the Read Storage grows the probability grows since the Write Storage has a fixed size.&lt;br /&gt;
&lt;br /&gt;
The benchmark completes after a given number of seconds and the CSV files are copied from the cluster to the host from which the benchmark is run.&lt;br /&gt;
&lt;br /&gt;
=== Services ===&lt;br /&gt;
&lt;br /&gt;
* Operating System Debian GNU/Linux bullseye&lt;br /&gt;
* Read Storage&lt;br /&gt;
** 4+2 erasure coded pool&lt;br /&gt;
** Ceph pacific https://docs.ceph.com/en/pacific/rados/operations/erasure-code/&lt;br /&gt;
* Write Storage&lt;br /&gt;
** A PostgreSQL master server&lt;br /&gt;
** A PostgreSQL (cold) standby server replicating the master server&lt;br /&gt;
** PosgreSQL 13 https://www.postgresql.org/docs/13/&lt;br /&gt;
&lt;br /&gt;
== Writing ==&lt;br /&gt;
&lt;br /&gt;
There are --winery-bench-rw-workers, each of them is in charge of creating a single Shard. When a worker completes, another one is created immediately.&lt;br /&gt;
&lt;br /&gt;
The size of each Shard is at least --winery-shard-max-size bytes and the exact number of objects it contains depends on the random distribution of the object size.&lt;br /&gt;
&lt;br /&gt;
== Reading ==&lt;br /&gt;
&lt;br /&gt;
There are --winery-bench-ro-workers processes. Each process picks --winery-bench-ro-worker-max-request Object ID from the Global Index at random, reads its content, verify it is not None and discards it.&lt;br /&gt;
&lt;br /&gt;
= Developing the benchmark software and report =&lt;br /&gt;
&lt;br /&gt;
The benchmark [https://forge.softwareheritage.org/source/swh-objstorage/browse/master/swh/objstorage/tests/winery_benchmark.py is part of the test suite] so that its code can be conveniently maintained along with it.&lt;br /&gt;
&lt;br /&gt;
It runs in minimal test mode with tox -e winery when the [https://forge.softwareheritage.org/source/swh-objstorage/browse/master/winery-test-environment/README.md test environment] is installed locally with libvirt.&lt;br /&gt;
&lt;br /&gt;
It is also run at scale against [https://www.grid5000.fr/ grid5000] to sanity check issues that do not show on a small cluster. As of January 2022 [https://git.easter-eggs.org/biceps/stats the results] are in the same order of magnitude as what is expected but the hardware configuration is very different and it cannot be assumed that it will accurately reflect the performances of the target installation.&lt;br /&gt;
&lt;br /&gt;
* Bytes write   26.9 MB/s&lt;br /&gt;
* Objects write 1.3 Kobject/s&lt;br /&gt;
* Bytes read    78.5 MB/s&lt;br /&gt;
* Objects read  3.8 Kobject/s&lt;br /&gt;
&lt;br /&gt;
== Hardware ==&lt;br /&gt;
&lt;br /&gt;
The hardware provided by [https://www.grid5000.fr/ grid5000] consists of:&lt;br /&gt;
&lt;br /&gt;
* A [https://www.grid5000.fr/w/Grenoble:Network Dell S5296F-ON] 10GB switch&lt;br /&gt;
* 32 [https://www.grid5000.fr/w/Grenoble:Hardware#dahu Dell PowerEdge C6420]&lt;br /&gt;
** System: 240 GB SSD SATA Samsung MZ7KM240HMHQ0D3&lt;br /&gt;
** Storage: 4.0 TB HDD SATA Seagate ST4000NM0265-2DC&lt;br /&gt;
** Intel Xeon Gold 6130 (Skylake, 2.10GHz, 2 CPUs/node, 16 cores/CPU)&lt;br /&gt;
** 192 GiB RAM&lt;br /&gt;
** One 10GB link to the switch&lt;br /&gt;
* 2 [https://www.grid5000.fr/w/Grenoble:Hardware#dahu Dell PowerEdge R940]&lt;br /&gt;
** System: 480 GB SSD SATA Intel SSDSC2KG480G7R&lt;br /&gt;
** Storage: 2 x 1.6 TB SSD NVME Dell Dell Express Flash NVMe PM1725 1.6TB AIC&lt;br /&gt;
** Intel Xeon Gold 6130 (Skylake, 2.10GHz, 4 CPUs/node, 16 cores/CPU)&lt;br /&gt;
** 768 GiB RAM&lt;br /&gt;
** One 10GB link to the switch&lt;br /&gt;
&lt;br /&gt;
Note: The machines have additional resources (network, disk) but only those used in the context of the benchmark are listed.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Premilinary_Object_Storage_Benchmarks&amp;diff=1612</id>
		<title>Premilinary Object Storage Benchmarks</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Premilinary_Object_Storage_Benchmarks&amp;diff=1612"/>
		<updated>2021-09-16T13:39:04Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Dachary moved page Premilinary Object Storage Benchmarks to Preliminary Object Storage Benchmarks: Typo in the title&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;#REDIRECT [[Preliminary Object Storage Benchmarks]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Preliminary_Object_Storage_Benchmarks&amp;diff=1611</id>
		<title>Preliminary Object Storage Benchmarks</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Preliminary_Object_Storage_Benchmarks&amp;diff=1611"/>
		<updated>2021-09-16T13:39:04Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Dachary moved page Premilinary Object Storage Benchmarks to Preliminary Object Storage Benchmarks: Typo in the title&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Introduction =&lt;br /&gt;
&lt;br /&gt;
An object storage suitable to store over 100 billions immutable small objects [https://wiki.softwareheritage.org/wiki/A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph was designed in March 2021]. A benchmark was [https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-08-06 written] and [https://forge.softwareheritage.org/T3149 tested] to demonstrate the design could deliver the expected performances.&lt;br /&gt;
&lt;br /&gt;
Based on the results presented in this document, hardware procurement started in July 2021. The Software Heritage [https://docs.softwareheritage.org/ codebase] is going be modified to implement the proposed object storage, starting August 2021. The benchmarks will be refactored to use the actual codebase and run on a regular basis and verify it performs as expected.&lt;br /&gt;
&lt;br /&gt;
= Benchmark results =&lt;br /&gt;
&lt;br /&gt;
For the duration of [https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-08-06 the benchmark] (5 days), over 5,000 objects per second (~100MB/s) were written to object storage.&lt;br /&gt;
&lt;br /&gt;
[[File:1.png]]&lt;br /&gt;
&lt;br /&gt;
Simultaneously, over 23,000 objects per second (~100MB/s) were read, either sequentially or randomly.&lt;br /&gt;
&lt;br /&gt;
[[File:2.png]]&lt;br /&gt;
&lt;br /&gt;
The time to randomly read a 4KB object is lower than 100ms 98% of the time. The graph below shows that the worst access time never exceeds 3 seconds over the entire duration of the benchmark.&lt;br /&gt;
&lt;br /&gt;
[[File:3.png]]&lt;br /&gt;
&lt;br /&gt;
= On read performances degradation =&lt;br /&gt;
&lt;br /&gt;
The results show that the read performances degrade over time. This was discovered during the last run (August 2021) of the benchmark which lasted five days. All previous runs (between April and July 2021) lasted less than three days and this degradation was not detected.&lt;br /&gt;
&lt;br /&gt;
A [https://forge.softwareheritage.org/T3422#68638 preliminary analysis] suggests the root problem is not the Ceph cluster. The benchmark software does not implement throttling for reads and this could be the cause of the problem.&lt;br /&gt;
&lt;br /&gt;
= Hardware =&lt;br /&gt;
&lt;br /&gt;
The hardware provided by [https://www.grid5000.fr/ grid5000] consists of:&lt;br /&gt;
&lt;br /&gt;
* A [https://www.grid5000.fr/w/Grenoble:Network Dell S5296F-ON] 10GB switch&lt;br /&gt;
* 32 [https://www.grid5000.fr/w/Grenoble:Hardware#dahu Dell PowerEdge C6420]&lt;br /&gt;
** System: 240 GB SSD SATA Samsung MZ7KM240HMHQ0D3&lt;br /&gt;
** Storage: 4.0 TB HDD SATA Seagate ST4000NM0265-2DC&lt;br /&gt;
** Intel Xeon Gold 6130 (Skylake, 2.10GHz, 2 CPUs/node, 16 cores/CPU)&lt;br /&gt;
** 192 GiB RAM&lt;br /&gt;
** One 10GB link to the switch&lt;br /&gt;
* 2 [https://www.grid5000.fr/w/Grenoble:Hardware#dahu Dell PowerEdge R940]&lt;br /&gt;
** System: 480 GB SSD SATA Intel SSDSC2KG480G7R&lt;br /&gt;
** Storage: 2 x 1.6 TB SSD NVME Dell Dell Express Flash NVMe PM1725 1.6TB AIC&lt;br /&gt;
** Intel Xeon Gold 6130 (Skylake, 2.10GHz, 4 CPUs/node, 16 cores/CPU)&lt;br /&gt;
** 768 GiB RAM&lt;br /&gt;
** One 10GB link to the switch&lt;br /&gt;
&lt;br /&gt;
Note: The machines have additional resources (network, disk) but only those used in the context of the benchmark are listed.&lt;br /&gt;
&lt;br /&gt;
= Methodology =&lt;br /&gt;
&lt;br /&gt;
A single process drives the benchmark on a dedicated machine. It runs multiple write workers, each in a separate process (to avoid any lock contention) which report how many objects and bytes they wrote when terminating. It runs mulitple read workers in parallel, also in separate processes, because they are blocking on reads from RBD images. The read workers report how many objects they read and how many bytes in total. The objects per second and bytes per second and other relevant measures are stored in files every 5 seconds, to be analyzed when the run completes.&lt;br /&gt;
&lt;br /&gt;
A write worker starts by writing objects in the Write Storage, using a payload obtained from a random source to avoid unintended gain due to easily compressible content. The Global Index is updated by adding a new entry for every object written in the Write Storage. The Global Index is pre-filled before the benchmark starts running to better reflect a realistic workload.&lt;br /&gt;
&lt;br /&gt;
When the Shard is full in the Write Storage, a second phase starts and it is moved to the Read Storage. The content of the Shard is read from the database twice: once for building the index and another to obtain the content. The index and the content of the Shard are written to the RBD image in chunks of 4MB. Once finished, the Shard is deleted from the Write Storage.&lt;br /&gt;
&lt;br /&gt;
A read worker opens a single Shard and either reads objects:&lt;br /&gt;
&lt;br /&gt;
* at random or,&lt;br /&gt;
* sequentially from a random point in the Shard&lt;br /&gt;
&lt;br /&gt;
In both cases it stops after reading a few megabytes to keep the processes short lived and rotate over as many Shards are possible so that the content of the Shard is unlikely to be found in the cache.&lt;br /&gt;
&lt;br /&gt;
The time it takes to read an object at random is measured and they are stored in a file when it exceeds 100ms. It is interpreted as the time to first byte although it really is the time to get the entire object. However, since the objects are small, the difference is considered marginal.&lt;br /&gt;
&lt;br /&gt;
The writers compete with the readers: without any reader, the writers will perform better (object per second and bytes per second). The number of read workers is set (trial and error) to deliver the desired performances (more workers means better performances) without hindering the writers.&lt;br /&gt;
&lt;br /&gt;
== Assumptions ==&lt;br /&gt;
&lt;br /&gt;
* The creation of the perfect hash table for a given Shard is not benchmarked and assumed to be negligible. It is however expensive when implemented in Python and must be compiled in or implemented in C or a similar language.&lt;br /&gt;
* The majority of reads are sequential (for mirroring and analysis) and a minority are random (from the API)&lt;br /&gt;
* The large objects (i.e. &amp;amp;gt; 80KB) are not taken into consideration. The small objects are the only one causing problems and having larger objects would improve the benchmarks.&lt;br /&gt;
* There are no huge objects because their size is capped at 100MB.&lt;br /&gt;
* The random source is a file created from /dev/urandom because it is too slow (~200MB/s)&lt;br /&gt;
* Reads only target the Read Storage although they may also target the Write Storage in a production environment. As the cluster grows, the number of objects in the Read Storage will grow while the number of objects in the Write Storage will remain constant. As a consequence the number of read using the Write Storage will become smaller over time.&lt;br /&gt;
&lt;br /&gt;
=== Services ===&lt;br /&gt;
&lt;br /&gt;
* Operating System Debian GNU/Linux buster&lt;br /&gt;
* Read Storage&lt;br /&gt;
** 4+2 erasure coded pool&lt;br /&gt;
** Ceph Octopus https://docs.ceph.com/en/octopus/rados/operations/erasure-code/&lt;br /&gt;
* Write Storage&lt;br /&gt;
** A PostgreSQL master server&lt;br /&gt;
** A PostgreSQL (cold) standby server replicating the master server&lt;br /&gt;
** PosgreSQL 11 https://www.postgresql.org/docs/11/&lt;br /&gt;
&lt;br /&gt;
== Writing ==&lt;br /&gt;
&lt;br /&gt;
There are –rw-workers processes, each of them is in charge of creating a single Shard. When a worker completes, another one is created immediately. It stops after –file-count-ro Shards have been created.&lt;br /&gt;
&lt;br /&gt;
The size of each Shard is at least –file-size MB and the exact number of objects it contains depends on the random distribution of the object size.&lt;br /&gt;
&lt;br /&gt;
* Creating the Shard in Write Storage&lt;br /&gt;
** Create a database named after a random UUID&lt;br /&gt;
** Add objects to the database, up to –file-size MB&lt;br /&gt;
** The size distribution of the objects in the database mimics the distribution described in the object storage design&lt;br /&gt;
** The content of the objects is random&lt;br /&gt;
* When a Shard is full in the Write Storage&lt;br /&gt;
** Create a RBD image named after the UUID and mount it&lt;br /&gt;
** Iterates over the content of the Shard&lt;br /&gt;
*** Write an index at the beginning of the RBD image&lt;br /&gt;
*** Write the content of each object after the index&lt;br /&gt;
** Remount the RBD image readonly&lt;br /&gt;
&lt;br /&gt;
== Reading ==&lt;br /&gt;
&lt;br /&gt;
There are –ro-workers processes. Each process picks a Shard from the Read Storage at random and a workload (sequential or random).&lt;br /&gt;
&lt;br /&gt;
* Random workload&lt;br /&gt;
** Seek to a random position in the Shard&lt;br /&gt;
** Read an object&lt;br /&gt;
** The size distribution of the objects in the database mimic the distribution described in the object storage design&lt;br /&gt;
* Sequential workload&lt;br /&gt;
** Seek to a random position in the Shard&lt;br /&gt;
** Read 4MB objects sequentially&lt;br /&gt;
** Assume objects have a size of 4KB (the median object size)&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1607</id>
		<title>A practical approach to efficiently store 100 billions small objects in Ceph</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1607"/>
		<updated>2021-08-31T08:31:29Z</updated>

		<summary type="html">&lt;p&gt;Dachary: EOS is not based on Ceph&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [https://en.wikipedia.org/wiki/Software_Heritage Software Heritage] project mission is to collect, preserve and share all software that is available in source code form, with the goal of building a common, shared infrastructure at the service of industry, research, culture and society as a whole. As of February 2021 it contains 10 billions unique source code files (or “objects”, in the following) totaling ~750TB of (uncompressed) data and grows by 50TB every month. 75% of these objects have a size smaller than 16KB and 50% have a size smaller than 4KB. But these small objects only account for ~5% of the 750TB: 25% of the objects have a size &amp;amp;gt; 16KB and occupy ~700TB.&lt;br /&gt;
&lt;br /&gt;
The desired performances for '''10PB''' and '''100 billions objects''' are as follows:&lt;br /&gt;
&lt;br /&gt;
* The clients aggregated together can write at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* The clients aggregated together can read at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* There is no space amplification for small objects.&lt;br /&gt;
* Getting the first byte of any object never takes longer than 100ms.&lt;br /&gt;
* Objects can be enumerated in bulk, at least one million at a time.&lt;br /&gt;
* Mirroring the content of the Software Heritage archive can be done in bulk, at least one million objects at a time.&lt;br /&gt;
&lt;br /&gt;
Using an off-the-shelf object storage such as the [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] or [https://min.io/ MinIO] does not meet the requirements:&lt;br /&gt;
&lt;br /&gt;
* There is a significant space amplification for small objects: at least 25%, depending on the object storage (see “How does packing Objects save space?” below for details)&lt;br /&gt;
* Mirroring the content of the archive can only be done one object at a time and not in bulk which takes at least 10 times longer (see “How does packing Objects help with enumeration?” for details)&lt;br /&gt;
&lt;br /&gt;
A new solution must be implemented by re-using existing components and made available for system administrators to conveniently deploy and maintain in production. There are three ways to do that:&lt;br /&gt;
&lt;br /&gt;
* Contribute packaging and stable releases to a codebase such as [https://github.com/linkedin/ambry Ambry].&lt;br /&gt;
* Modify an object storage such as MinIO to support object packing.&lt;br /&gt;
* Get inspiration from an object storage design.&lt;br /&gt;
&lt;br /&gt;
For reasons explained below (see “Storage solutions and TCO”), it was decided to design a new object storage and implement it from scratch.&lt;br /&gt;
&lt;br /&gt;
= Proposed object storage design =&lt;br /&gt;
&lt;br /&gt;
In a nutshell, objects are written to databases running on a fixed number of machines (the Write Storage) that can vary to control the write throughput. When a threshold is reached (e.g. 100GB) all objects are put together in container (a Shard), and moved to a readonly storage that keeps expanding over time. After a successful write, a unique identifier (the Object ID) is returned to the client. It can be used to read the object back from the readonly storage. Reads scale out because the unique identifiers of the objects embed the name of the container (the Shard UUID). Writes also scales out because the Database is chosen randomly. This is the Layer 0.&lt;br /&gt;
&lt;br /&gt;
Clients that cannot keep track of the name of the container can rely on an API that relies on an index mapping all known objects signatures (the Object HASH below) to the name of the container where they can be found. Although this index prevents scaling out writes, the readonly storage can still scale out by multiplying copies of the index as needed. This is the Layer 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
                      Layer 0 scales out&lt;br /&gt;
&lt;br /&gt;
      +--- write op ----+               +--- read  op ----+&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
   Object &amp;amp;             |               |                 |&lt;br /&gt;
   Object HASH     Object ID         Object ID         Object&lt;br /&gt;
      |            Object HASH          |                 |&lt;br /&gt;
      v            Shard UUID           v                 ^&lt;br /&gt;
      |                 |               |                 |&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage --------+&lt;br /&gt;
    |                           |  |                          |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database |-&amp;gt;--Packing-&amp;gt;----&amp;gt; | Shard |      | Shard | |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database || Database |  |  | | Shard |      | Shard | |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    |                           |  | +-------+      +-------+ |&lt;br /&gt;
    +---------------------------+  | | Shard |      | Shard | |&lt;br /&gt;
                                   | +-------+      +-------+ |&lt;br /&gt;
                                   |            ...           |&lt;br /&gt;
                                   +--------------------------+&lt;br /&gt;
&lt;br /&gt;
                      Layer 1 reads scale out&lt;br /&gt;
&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage ---------+&lt;br /&gt;
    |                           |  |                           |&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    ||Object HASH to Shard UUID||  ||Object HASH to Shard UUID||&lt;br /&gt;
    ||        index            |&amp;gt;&amp;gt;&amp;gt;&amp;gt;|        index            ||&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    +---------------------------+  |+-------------------------+|&lt;br /&gt;
       |                 |         ||Object HASH to Shard UUID||&lt;br /&gt;
       ^                 v         ||        index            ||&lt;br /&gt;
       |                 |         |+-------------------------+|&lt;br /&gt;
       ^                 v         |          ...              |&lt;br /&gt;
     Object              |         +---------------------------+&lt;br /&gt;
   Object HASH           v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |            Object HASH        Object&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       +--- write op ----+                +--- read  op ----+&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Ceph-objstorage-sw-architecture.svg]]&lt;br /&gt;
&lt;br /&gt;
== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* Object: an opaque sequence of bytes.&lt;br /&gt;
* Object HASH: the hash of an Object, e.g., the checksum part of a [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers SWHID].&lt;br /&gt;
* Shard: a group of Objects, used to partition the full set of objects into manageable subsets.&lt;br /&gt;
* Shard UUID: the unique identifier of a Shard, as a [https://en.wikipedia.org/wiki/Universally_unique_identifier UUID].&lt;br /&gt;
* Object ID: a pair made of the Object HASH and the Shard UUID containing the object.&lt;br /&gt;
* Global Index: a table mapping the Object HASH to the Shard UUID that contains the Object.&lt;br /&gt;
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.&lt;br /&gt;
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.&lt;br /&gt;
* Object Storage: the content of the Write Storage and the Read Storage combined.&lt;br /&gt;
* Database: [https://en.wikipedia.org/wiki/PostgreSQL PostgreSQL], [https://en.wikipedia.org/wiki/Apache_Cassandra Cassandra], etc.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Ceph_(software) Ceph]: a self-healing distributed storage.&lt;br /&gt;
* [https://docs.ceph.com/en/latest/rbd/ RBD] image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Total_cost_of_ownership TCO]: Total Cost of Ownership&lt;br /&gt;
&lt;br /&gt;
The key concepts are:&lt;br /&gt;
&lt;br /&gt;
* Packing millions of Objects together in Shards to:&lt;br /&gt;
** save space and,&lt;br /&gt;
** efficiently perform bulk actions such as mirroring or enumerations.&lt;br /&gt;
* Two different storage:&lt;br /&gt;
** Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,&lt;br /&gt;
** Write Storage from which Shards are created and moved to the Read Storage.&lt;br /&gt;
* Identifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.&lt;br /&gt;
&lt;br /&gt;
While the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.&lt;br /&gt;
&lt;br /&gt;
The content of the Object Storage (i.e., the Write Storage and the Read Storage combined) is '''strongly/strictly consistent'''. As soon as an Object is written (i.e., the write operation returns to the client), a reader can get the Object content from the Object Storage (with the caveat that it may require looking up the object from both the Write Storage and Read Storage).&lt;br /&gt;
&lt;br /&gt;
The Read Storage is '''eventually consistent'''. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, up to a given point in time.&lt;br /&gt;
&lt;br /&gt;
== Layer 0 (Object lookup require a complete Object ID) ==&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** A fixed number of Databases&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Shards implemented as Ceph RBD images named after their Shard UUID&lt;br /&gt;
** The content of the Shard uses a format that allows retrieving an Object in O(1) given the Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database is associated with a unique Shard UUID, chosen at random. All Objects written to a Database will be stored in the same Shard.&lt;br /&gt;
&lt;br /&gt;
A successful Object write returns the Object ID. Writing the same object twice may return different Object IDs. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different since it is chosen at random.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
When a Database grows bigger than a threshold (for instance 100GB), it stops accepting writes. A Shard is created in the Read Storage and Objects in the Database are sorted and copied to it. When the Shard is complete, the Database is deleted. Another Database is created, a new Shard UUID is allocated and it starts accepting writes.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
The Shard UUID is extracted from the Object ID. If a Shard exists in the Read Storage, the Object HASH is used to lookup the content of the Object. Otherwise the Database that owns the Shard UUID is looked up in the Write Storage and the Object HASH is used to lookup the content of the Object. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.&lt;br /&gt;
&lt;br /&gt;
== Layer 1 (Objects can be looked up using the Object HASH alone) ==&lt;br /&gt;
&lt;br /&gt;
A Global Index mapping the Object HASH of all known Objects to the Shard UUID is used to:&lt;br /&gt;
&lt;br /&gt;
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.&lt;br /&gt;
* deduplicate identical Objects based on their Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Write Storage and the Read Storage&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Read Storage&lt;br /&gt;
** Multiple readonly replicas of the Global Index of all known Objects in the Read Storage&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH exists in the Read Storage Global Index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage Global Index. There may be duplicate Objects in the Write Storage. It is expected that they race to be inserted in the Write Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
During packing, each Object HASH is looked up in the Read Storage Global Index. If it exists, the object is discarded. Otherwise its Object ID is added to the Read Storage Global Index. When packing is complete:&lt;br /&gt;
&lt;br /&gt;
* Readonly replicas of the Read Storage Global Index are updated with the newly added Object IDs.&lt;br /&gt;
* Object HASH that were found to be duplicate are updated in the Write Storage Global Index. The Object HASH is mapped to the Shard UUID retrieved from the Read Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH is found in the Read Storage Global Index, use the Shard UUID to read the Object content from the Shard found in the Read Storage. Otherwise lookup the Object HASH from the Write Storage Global Index and read the content of the Object from the Database that owns the Shard UUID.&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects save space? =&lt;br /&gt;
&lt;br /&gt;
The short answer is: it does not when Objects are big enough, but it does when there are a lot of small Objects.&lt;br /&gt;
&lt;br /&gt;
If there are billions of objects (i.e., less than one billion is not a lot) and 50% of them have a size smaller than 4KB and 75% of them have a size smaller than 16KB (i.e., bigger than 16KB is not small), then packing will save space.&lt;br /&gt;
&lt;br /&gt;
In the simplest method of packing (i.e., appending each Object after another in a file) and since the Object HASH has a fixed size, the only overhead for each object is the size of the Object (8 bytes). Assuming the Shard containing the Objects is handled as a single 100GB Ceph RBD Image, it adds R bytes. If the underlying Ceph pool is erasure coded k=4,m=2 an additional 50% must be added.&lt;br /&gt;
&lt;br /&gt;
Retrieving an Object from a Shard would be O(n) in this case because there is no index. It is more efficient to [https://en.wikipedia.org/wiki/Perfect_hash_function add a minimal hash table] to the Shard so that finding an object is O(1) instead. That optimization requires an additional 8 bytes per Object to store their offset, i.e. a total of 16 bytes per object.&lt;br /&gt;
&lt;br /&gt;
If Objects are not packed together, each of them requires at least B bytes, which is the minimum space overhead imposed by the underlying storage system. And an additional 50% for durability. The space used by Objects that are smaller than a given threshold will be amplified, depending on the underlying storage. For instance all objects in Ceph have a minimum size of 4KB, therefore the size of a 1KB Object will be amplified to 4KB which translates to a [https://forge.softwareheritage.org/T3052#58864 35% space amplification]. Another example is MinIO with [https://github.com/minio/minio/issues/7395#issuecomment-475161144 over 200% space amplification] or [https://wiki.openstack.org/wiki/Swift/ideas/small_files#Challenges Swift] for which [https://www.ovh.com/blog/dealing-with-small-files-with-openstack-swift-part-2/ packing small files was recently proposed].&lt;br /&gt;
&lt;br /&gt;
To summarize, the overhead of storing M Objects totaling S bytes with M=100 billions and S=10PB is:&lt;br /&gt;
&lt;br /&gt;
* '''packed:''' ~15.5PB&lt;br /&gt;
** (S / 100GB) * R == (10PB / 100GB) * R bytes = 10,000 * R bytes&lt;br /&gt;
** (M * 24) = 100G Objects * 24 bytes = 2.4TB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
* '''not packed:''' ~17.5PB based on the optimistic assumption that the storage system has a 25% space overhead for small files&lt;br /&gt;
** 25% for space amplification = 10PB * 0.25 = 2.5PB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects help with enumeration? =&lt;br /&gt;
&lt;br /&gt;
For mirroring or running an algorithm on all objects, they must be enumerated. If they are not packed together in any way, which is the case with MinIO or Swift, they must be looked up individually. When they are packed together (one million or more), the reader can download an entire Shard instead, saving the accumulated delay imposed by millions of individual lookup.&lt;br /&gt;
&lt;br /&gt;
If looking up an individual Object takes 10 milliseconds and Shards can be read at 100MB/s:&lt;br /&gt;
&lt;br /&gt;
* Getting 1 billion objects requires 10 millions seconds which is over 100 days.&lt;br /&gt;
* One billion objects is 1/10 of the current content of Software Heritage, i.e. ~75TB which can be transferred by reading the Shards in less than 10 days&lt;br /&gt;
&lt;br /&gt;
= Storage solutions and TCO =&lt;br /&gt;
&lt;br /&gt;
When looking for off-the-shelf solutions all options were considered, [https://forge.softwareheritage.org/T3107 including distributed file systems such as IPFs and more] and most of them were discarded because they had at least one blocker that could not be fixed (e.g. no feature to guarantee the durability of an object). In the end a few remained, either including the following features or with the possibility for a third party to contribute them back to the project:&lt;br /&gt;
&lt;br /&gt;
* '''Scale''' to 100 billions objects&lt;br /&gt;
* Provide object '''packing'''&lt;br /&gt;
* Provide detailed '''documentation''' and community support for system administrators operating the storage&lt;br /&gt;
* Be thoroughly '''tested''' before a stable release is published&lt;br /&gt;
* Be '''packaged''' for at least one well known distribution&lt;br /&gt;
* Have '''stable releases''' maintained for at least two years&lt;br /&gt;
* A sound approach to address '''security''' problems (CVE etc.)&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
! Name&lt;br /&gt;
! RGW&lt;br /&gt;
! SeaweedFS&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Scaling&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Packing&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Documentation&lt;br /&gt;
| Good&lt;br /&gt;
| Terse&lt;br /&gt;
| Good&lt;br /&gt;
| Good&lt;br /&gt;
| Terse&lt;br /&gt;
|-&lt;br /&gt;
| Tests&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
| Average&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
|-&lt;br /&gt;
| Packages&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Stable releases&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Security&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Does not have stable releases and testing ==&lt;br /&gt;
&lt;br /&gt;
The performance goals, size distribution and the number of objects in Software Heritage are similar to what is described in the 2010 article “[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf Finding a needle in Haystack: Facebook’s photo storage]” that motivated the implementation of [https://github.com/chrislusf/seaweedfs SeaweedFS] in 2013 or [https://github.com/linkedin/ambry Ambry], the object storage published in 2017 by LinkedIn to store and serve trillions of media objects in web companies.&lt;br /&gt;
&lt;br /&gt;
Contributing to SeaweedFS or Ambry so they can be deployed and maintained would require:&lt;br /&gt;
&lt;br /&gt;
* Creating packages for the target Operating System (e.g. Debian GNU/Linux), maintaining a repository to distribute them, upload them to the official distribution repository so that they are available in the next stable release (about two years from now)&lt;br /&gt;
* Creating Ansible roles or Puppet modules for deployment on multiple machines&lt;br /&gt;
* Improving the documentation with a configuration and architecture guide to deploy at scale&lt;br /&gt;
* Discuss with upstream to create stable releases, define their lifecycle and organize release management&lt;br /&gt;
* Establish a security team in charge of handling the CVE&lt;br /&gt;
* Setup and infrastructure and create the software for integration testing to be run before a stable release is published to reduce the risk of regressions or data loss. This is specially important because a significant part of the software is dedicated to data storage and replication: bugs can lead to data loss or corruption.&lt;br /&gt;
&lt;br /&gt;
== Does not provide object packing ==&lt;br /&gt;
&lt;br /&gt;
[https://min.io/ MinIO] and [https://docs.openstack.org/swift/latest/ Swift] suffer from a space amplification problem and they do not provide object packing. Although [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] (also known as RGW) stores objects in RocksDB instead of files, it also suffers from a space amplification problem and does not provide object packing.&lt;br /&gt;
&lt;br /&gt;
Contributing to RGW, MinIO or Swift to add object packing would require:&lt;br /&gt;
&lt;br /&gt;
* Creating a blueprint to modify the internals to add object packing&lt;br /&gt;
* Discuss with upstream to validate the blueprint&lt;br /&gt;
* Implement the blueprint and the associated tests&lt;br /&gt;
&lt;br /&gt;
== Estimating the TCO ==&lt;br /&gt;
&lt;br /&gt;
Since no solution can be used as is, some work must be done in each case and the effort it requires should be compared. It is however difficult because the nature of the effort is different. The following factors were considered and aggregated in a TCO estimate.&lt;br /&gt;
&lt;br /&gt;
* '''Data loss risk:''' if a bug in the work done implies the risk of losing data, it makes the work significantly more complicated. It is the case if packing must be implemented in the internals of an existing object storage such as Swift. It is also the case if an object storage does not have integration testing to verify upgrading to a newer version won’t lead to a regression, which is the case with Ambry. It is likely that the Ambry upstream has extensive integration testing but they are not published.&lt;br /&gt;
* '''Large codebase:''' a large codebase means modifying it (to implement packing) or distributing it (packaging and documentation) is more difficult&lt;br /&gt;
* '''Language:''' if the language and its environment is familiar to the developers and the system administrators, the work is less difficult&lt;br /&gt;
* '''Skills:''' if the work requires highly specialized skills (such as an intimate understanding of how a distributed storage system guarantees a strict consistency of the data, or running integration tests that require a cluster of machines) it is more difficult&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!&lt;br /&gt;
! RGW&lt;br /&gt;
! SeaweedFS&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Data loss risk&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Large codebase&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Language&lt;br /&gt;
| C++&lt;br /&gt;
| Python&lt;br /&gt;
| Go&lt;br /&gt;
| Go&lt;br /&gt;
| Python&lt;br /&gt;
| Java&lt;br /&gt;
|-&lt;br /&gt;
| Skills&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|-&lt;br /&gt;
| TCO estimate&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In a nutshell, implementing a system from scratch has the lowest TCO estimate, primarily because it is independent of the underlying distributed storage.&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Preliminary_Object_Storage_Benchmarks&amp;diff=1606</id>
		<title>Preliminary Object Storage Benchmarks</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Preliminary_Object_Storage_Benchmarks&amp;diff=1606"/>
		<updated>2021-08-30T10:31:43Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Add category&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Introduction =&lt;br /&gt;
&lt;br /&gt;
An object storage suitable to store over 100 billions immutable small objects [https://wiki.softwareheritage.org/wiki/A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph was designed in March 2021]. A benchmark was [https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-08-06 written] and [https://forge.softwareheritage.org/T3149 tested] to demonstrate the design could deliver the expected performances.&lt;br /&gt;
&lt;br /&gt;
Based on the results presented in this document, hardware procurement started in July 2021. The Software Heritage [https://docs.softwareheritage.org/ codebase] is going be modified to implement the proposed object storage, starting August 2021. The benchmarks will be refactored to use the actual codebase and run on a regular basis and verify it performs as expected.&lt;br /&gt;
&lt;br /&gt;
= Benchmark results =&lt;br /&gt;
&lt;br /&gt;
For the duration of [https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-08-06 the benchmark] (5 days), over 5,000 objects per second (~100MB/s) were written to object storage.&lt;br /&gt;
&lt;br /&gt;
[[File:1.png]]&lt;br /&gt;
&lt;br /&gt;
Simultaneously, over 23,000 objects per second (~100MB/s) were read, either sequentially or randomly.&lt;br /&gt;
&lt;br /&gt;
[[File:2.png]]&lt;br /&gt;
&lt;br /&gt;
The time to randomly read a 4KB object is lower than 100ms 98% of the time. The graph below shows that the worst access time never exceeds 3 seconds over the entire duration of the benchmark.&lt;br /&gt;
&lt;br /&gt;
[[File:3.png]]&lt;br /&gt;
&lt;br /&gt;
= On read performances degradation =&lt;br /&gt;
&lt;br /&gt;
The results show that the read performances degrade over time. This was discovered during the last run (August 2021) of the benchmark which lasted five days. All previous runs (between April and July 2021) lasted less than three days and this degradation was not detected.&lt;br /&gt;
&lt;br /&gt;
A [https://forge.softwareheritage.org/T3422#68638 preliminary analysis] suggests the root problem is not the Ceph cluster. The benchmark software does not implement throttling for reads and this could be the cause of the problem.&lt;br /&gt;
&lt;br /&gt;
= Hardware =&lt;br /&gt;
&lt;br /&gt;
The hardware provided by [https://www.grid5000.fr/ grid5000] consists of:&lt;br /&gt;
&lt;br /&gt;
* A [https://www.grid5000.fr/w/Grenoble:Network Dell S5296F-ON] 10GB switch&lt;br /&gt;
* 32 [https://www.grid5000.fr/w/Grenoble:Hardware#dahu Dell PowerEdge C6420]&lt;br /&gt;
** System: 240 GB SSD SATA Samsung MZ7KM240HMHQ0D3&lt;br /&gt;
** Storage: 4.0 TB HDD SATA Seagate ST4000NM0265-2DC&lt;br /&gt;
** Intel Xeon Gold 6130 (Skylake, 2.10GHz, 2 CPUs/node, 16 cores/CPU)&lt;br /&gt;
** 192 GiB RAM&lt;br /&gt;
** One 10GB link to the switch&lt;br /&gt;
* 2 [https://www.grid5000.fr/w/Grenoble:Hardware#dahu Dell PowerEdge R940]&lt;br /&gt;
** System: 480 GB SSD SATA Intel SSDSC2KG480G7R&lt;br /&gt;
** Storage: 2 x 1.6 TB SSD NVME Dell Dell Express Flash NVMe PM1725 1.6TB AIC&lt;br /&gt;
** Intel Xeon Gold 6130 (Skylake, 2.10GHz, 4 CPUs/node, 16 cores/CPU)&lt;br /&gt;
** 768 GiB RAM&lt;br /&gt;
** One 10GB link to the switch&lt;br /&gt;
&lt;br /&gt;
Note: The machines have additional resources (network, disk) but only those used in the context of the benchmark are listed.&lt;br /&gt;
&lt;br /&gt;
= Methodology =&lt;br /&gt;
&lt;br /&gt;
A single process drives the benchmark on a dedicated machine. It runs multiple write workers, each in a separate process (to avoid any lock contention) which report how many objects and bytes they wrote when terminating. It runs mulitple read workers in parallel, also in separate processes, because they are blocking on reads from RBD images. The read workers report how many objects they read and how many bytes in total. The objects per second and bytes per second and other relevant measures are stored in files every 5 seconds, to be analyzed when the run completes.&lt;br /&gt;
&lt;br /&gt;
A write worker starts by writing objects in the Write Storage, using a payload obtained from a random source to avoid unintended gain due to easily compressible content. The Global Index is updated by adding a new entry for every object written in the Write Storage. The Global Index is pre-filled before the benchmark starts running to better reflect a realistic workload.&lt;br /&gt;
&lt;br /&gt;
When the Shard is full in the Write Storage, a second phase starts and it is moved to the Read Storage. The content of the Shard is read from the database twice: once for building the index and another to obtain the content. The index and the content of the Shard are written to the RBD image in chunks of 4MB. Once finished, the Shard is deleted from the Write Storage.&lt;br /&gt;
&lt;br /&gt;
A read worker opens a single Shard and either reads objects:&lt;br /&gt;
&lt;br /&gt;
* at random or,&lt;br /&gt;
* sequentially from a random point in the Shard&lt;br /&gt;
&lt;br /&gt;
In both cases it stops after reading a few megabytes to keep the processes short lived and rotate over as many Shards are possible so that the content of the Shard is unlikely to be found in the cache.&lt;br /&gt;
&lt;br /&gt;
The time it takes to read an object at random is measured and they are stored in a file when it exceeds 100ms. It is interpreted as the time to first byte although it really is the time to get the entire object. However, since the objects are small, the difference is considered marginal.&lt;br /&gt;
&lt;br /&gt;
The writers compete with the readers: without any reader, the writers will perform better (object per second and bytes per second). The number of read workers is set (trial and error) to deliver the desired performances (more workers means better performances) without hindering the writers.&lt;br /&gt;
&lt;br /&gt;
== Assumptions ==&lt;br /&gt;
&lt;br /&gt;
* The creation of the perfect hash table for a given Shard is not benchmarked and assumed to be negligible. It is however expensive when implemented in Python and must be compiled in or implemented in C or a similar language.&lt;br /&gt;
* The majority of reads are sequential (for mirroring and analysis) and a minority are random (from the API)&lt;br /&gt;
* The large objects (i.e. &amp;amp;gt; 80KB) are not taken into consideration. The small objects are the only one causing problems and having larger objects would improve the benchmarks.&lt;br /&gt;
* There are no huge objects because their size is capped at 100MB.&lt;br /&gt;
* The random source is a file created from /dev/urandom because it is too slow (~200MB/s)&lt;br /&gt;
* Reads only target the Read Storage although they may also target the Write Storage in a production environment. As the cluster grows, the number of objects in the Read Storage will grow while the number of objects in the Write Storage will remain constant. As a consequence the number of read using the Write Storage will become smaller over time.&lt;br /&gt;
&lt;br /&gt;
=== Services ===&lt;br /&gt;
&lt;br /&gt;
* Operating System Debian GNU/Linux buster&lt;br /&gt;
* Read Storage&lt;br /&gt;
** 4+2 erasure coded pool&lt;br /&gt;
** Ceph Octopus https://docs.ceph.com/en/octopus/rados/operations/erasure-code/&lt;br /&gt;
* Write Storage&lt;br /&gt;
** A PostgreSQL master server&lt;br /&gt;
** A PostgreSQL (cold) standby server replicating the master server&lt;br /&gt;
** PosgreSQL 11 https://www.postgresql.org/docs/11/&lt;br /&gt;
&lt;br /&gt;
== Writing ==&lt;br /&gt;
&lt;br /&gt;
There are –rw-workers processes, each of them is in charge of creating a single Shard. When a worker completes, another one is created immediately. It stops after –file-count-ro Shards have been created.&lt;br /&gt;
&lt;br /&gt;
The size of each Shard is at least –file-size MB and the exact number of objects it contains depends on the random distribution of the object size.&lt;br /&gt;
&lt;br /&gt;
* Creating the Shard in Write Storage&lt;br /&gt;
** Create a database named after a random UUID&lt;br /&gt;
** Add objects to the database, up to –file-size MB&lt;br /&gt;
** The size distribution of the objects in the database mimics the distribution described in the object storage design&lt;br /&gt;
** The content of the objects is random&lt;br /&gt;
* When a Shard is full in the Write Storage&lt;br /&gt;
** Create a RBD image named after the UUID and mount it&lt;br /&gt;
** Iterates over the content of the Shard&lt;br /&gt;
*** Write an index at the beginning of the RBD image&lt;br /&gt;
*** Write the content of each object after the index&lt;br /&gt;
** Remount the RBD image readonly&lt;br /&gt;
&lt;br /&gt;
== Reading ==&lt;br /&gt;
&lt;br /&gt;
There are –ro-workers processes. Each process picks a Shard from the Read Storage at random and a workload (sequential or random).&lt;br /&gt;
&lt;br /&gt;
* Random workload&lt;br /&gt;
** Seek to a random position in the Shard&lt;br /&gt;
** Read an object&lt;br /&gt;
** The size distribution of the objects in the database mimic the distribution described in the object storage design&lt;br /&gt;
* Sequential workload&lt;br /&gt;
** Seek to a random position in the Shard&lt;br /&gt;
** Read 4MB objects sequentially&lt;br /&gt;
** Assume objects have a size of 4KB (the median object size)&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Preliminary_Object_Storage_Benchmarks&amp;diff=1605</id>
		<title>Preliminary Object Storage Benchmarks</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Preliminary_Object_Storage_Benchmarks&amp;diff=1605"/>
		<updated>2021-08-30T10:30:24Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Created page with &amp;quot;= Introduction =  An object storage suitable to store over 100 billions immutable small objects [https://wiki.softwareheritage.org/wiki/A_practical_approach_to_efficiently_sto...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Introduction =&lt;br /&gt;
&lt;br /&gt;
An object storage suitable to store over 100 billions immutable small objects [https://wiki.softwareheritage.org/wiki/A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph was designed in March 2021]. A benchmark was [https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-08-06 written] and [https://forge.softwareheritage.org/T3149 tested] to demonstrate the design could deliver the expected performances.&lt;br /&gt;
&lt;br /&gt;
Based on the results presented in this document, hardware procurement started in July 2021. The Software Heritage [https://docs.softwareheritage.org/ codebase] is going be modified to implement the proposed object storage, starting August 2021. The benchmarks will be refactored to use the actual codebase and run on a regular basis and verify it performs as expected.&lt;br /&gt;
&lt;br /&gt;
= Benchmark results =&lt;br /&gt;
&lt;br /&gt;
For the duration of [https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-08-06 the benchmark] (5 days), over 5,000 objects per second (~100MB/s) were written to object storage.&lt;br /&gt;
&lt;br /&gt;
[[File:1.png]]&lt;br /&gt;
&lt;br /&gt;
Simultaneously, over 23,000 objects per second (~100MB/s) were read, either sequentially or randomly.&lt;br /&gt;
&lt;br /&gt;
[[File:2.png]]&lt;br /&gt;
&lt;br /&gt;
The time to randomly read a 4KB object is lower than 100ms 98% of the time. The graph below shows that the worst access time never exceeds 3 seconds over the entire duration of the benchmark.&lt;br /&gt;
&lt;br /&gt;
[[File:3.png]]&lt;br /&gt;
&lt;br /&gt;
= On read performances degradation =&lt;br /&gt;
&lt;br /&gt;
The results show that the read performances degrade over time. This was discovered during the last run (August 2021) of the benchmark which lasted five days. All previous runs (between April and July 2021) lasted less than three days and this degradation was not detected.&lt;br /&gt;
&lt;br /&gt;
A [https://forge.softwareheritage.org/T3422#68638 preliminary analysis] suggests the root problem is not the Ceph cluster. The benchmark software does not implement throttling for reads and this could be the cause of the problem.&lt;br /&gt;
&lt;br /&gt;
= Hardware =&lt;br /&gt;
&lt;br /&gt;
The hardware provided by [https://www.grid5000.fr/ grid5000] consists of:&lt;br /&gt;
&lt;br /&gt;
* A [https://www.grid5000.fr/w/Grenoble:Network Dell S5296F-ON] 10GB switch&lt;br /&gt;
* 32 [https://www.grid5000.fr/w/Grenoble:Hardware#dahu Dell PowerEdge C6420]&lt;br /&gt;
** System: 240 GB SSD SATA Samsung MZ7KM240HMHQ0D3&lt;br /&gt;
** Storage: 4.0 TB HDD SATA Seagate ST4000NM0265-2DC&lt;br /&gt;
** Intel Xeon Gold 6130 (Skylake, 2.10GHz, 2 CPUs/node, 16 cores/CPU)&lt;br /&gt;
** 192 GiB RAM&lt;br /&gt;
** One 10GB link to the switch&lt;br /&gt;
* 2 [https://www.grid5000.fr/w/Grenoble:Hardware#dahu Dell PowerEdge R940]&lt;br /&gt;
** System: 480 GB SSD SATA Intel SSDSC2KG480G7R&lt;br /&gt;
** Storage: 2 x 1.6 TB SSD NVME Dell Dell Express Flash NVMe PM1725 1.6TB AIC&lt;br /&gt;
** Intel Xeon Gold 6130 (Skylake, 2.10GHz, 4 CPUs/node, 16 cores/CPU)&lt;br /&gt;
** 768 GiB RAM&lt;br /&gt;
** One 10GB link to the switch&lt;br /&gt;
&lt;br /&gt;
Note: The machines have additional resources (network, disk) but only those used in the context of the benchmark are listed.&lt;br /&gt;
&lt;br /&gt;
= Methodology =&lt;br /&gt;
&lt;br /&gt;
A single process drives the benchmark on a dedicated machine. It runs multiple write workers, each in a separate process (to avoid any lock contention) which report how many objects and bytes they wrote when terminating. It runs mulitple read workers in parallel, also in separate processes, because they are blocking on reads from RBD images. The read workers report how many objects they read and how many bytes in total. The objects per second and bytes per second and other relevant measures are stored in files every 5 seconds, to be analyzed when the run completes.&lt;br /&gt;
&lt;br /&gt;
A write worker starts by writing objects in the Write Storage, using a payload obtained from a random source to avoid unintended gain due to easily compressible content. The Global Index is updated by adding a new entry for every object written in the Write Storage. The Global Index is pre-filled before the benchmark starts running to better reflect a realistic workload.&lt;br /&gt;
&lt;br /&gt;
When the Shard is full in the Write Storage, a second phase starts and it is moved to the Read Storage. The content of the Shard is read from the database twice: once for building the index and another to obtain the content. The index and the content of the Shard are written to the RBD image in chunks of 4MB. Once finished, the Shard is deleted from the Write Storage.&lt;br /&gt;
&lt;br /&gt;
A read worker opens a single Shard and either reads objects:&lt;br /&gt;
&lt;br /&gt;
* at random or,&lt;br /&gt;
* sequentially from a random point in the Shard&lt;br /&gt;
&lt;br /&gt;
In both cases it stops after reading a few megabytes to keep the processes short lived and rotate over as many Shards are possible so that the content of the Shard is unlikely to be found in the cache.&lt;br /&gt;
&lt;br /&gt;
The time it takes to read an object at random is measured and they are stored in a file when it exceeds 100ms. It is interpreted as the time to first byte although it really is the time to get the entire object. However, since the objects are small, the difference is considered marginal.&lt;br /&gt;
&lt;br /&gt;
The writers compete with the readers: without any reader, the writers will perform better (object per second and bytes per second). The number of read workers is set (trial and error) to deliver the desired performances (more workers means better performances) without hindering the writers.&lt;br /&gt;
&lt;br /&gt;
== Assumptions ==&lt;br /&gt;
&lt;br /&gt;
* The creation of the perfect hash table for a given Shard is not benchmarked and assumed to be negligible. It is however expensive when implemented in Python and must be compiled in or implemented in C or a similar language.&lt;br /&gt;
* The majority of reads are sequential (for mirroring and analysis) and a minority are random (from the API)&lt;br /&gt;
* The large objects (i.e. &amp;amp;gt; 80KB) are not taken into consideration. The small objects are the only one causing problems and having larger objects would improve the benchmarks.&lt;br /&gt;
* There are no huge objects because their size is capped at 100MB.&lt;br /&gt;
* The random source is a file created from /dev/urandom because it is too slow (~200MB/s)&lt;br /&gt;
* Reads only target the Read Storage although they may also target the Write Storage in a production environment. As the cluster grows, the number of objects in the Read Storage will grow while the number of objects in the Write Storage will remain constant. As a consequence the number of read using the Write Storage will become smaller over time.&lt;br /&gt;
&lt;br /&gt;
=== Services ===&lt;br /&gt;
&lt;br /&gt;
* Operating System Debian GNU/Linux buster&lt;br /&gt;
* Read Storage&lt;br /&gt;
** 4+2 erasure coded pool&lt;br /&gt;
** Ceph Octopus https://docs.ceph.com/en/octopus/rados/operations/erasure-code/&lt;br /&gt;
* Write Storage&lt;br /&gt;
** A PostgreSQL master server&lt;br /&gt;
** A PostgreSQL (cold) standby server replicating the master server&lt;br /&gt;
** PosgreSQL 11 https://www.postgresql.org/docs/11/&lt;br /&gt;
&lt;br /&gt;
== Writing ==&lt;br /&gt;
&lt;br /&gt;
There are –rw-workers processes, each of them is in charge of creating a single Shard. When a worker completes, another one is created immediately. It stops after –file-count-ro Shards have been created.&lt;br /&gt;
&lt;br /&gt;
The size of each Shard is at least –file-size MB and the exact number of objects it contains depends on the random distribution of the object size.&lt;br /&gt;
&lt;br /&gt;
* Creating the Shard in Write Storage&lt;br /&gt;
** Create a database named after a random UUID&lt;br /&gt;
** Add objects to the database, up to –file-size MB&lt;br /&gt;
** The size distribution of the objects in the database mimics the distribution described in the object storage design&lt;br /&gt;
** The content of the objects is random&lt;br /&gt;
* When a Shard is full in the Write Storage&lt;br /&gt;
** Create a RBD image named after the UUID and mount it&lt;br /&gt;
** Iterates over the content of the Shard&lt;br /&gt;
*** Write an index at the beginning of the RBD image&lt;br /&gt;
*** Write the content of each object after the index&lt;br /&gt;
** Remount the RBD image readonly&lt;br /&gt;
&lt;br /&gt;
== Reading ==&lt;br /&gt;
&lt;br /&gt;
There are –ro-workers processes. Each process picks a Shard from the Read Storage at random and a workload (sequential or random).&lt;br /&gt;
&lt;br /&gt;
* Random workload&lt;br /&gt;
** Seek to a random position in the Shard&lt;br /&gt;
** Read an object&lt;br /&gt;
** The size distribution of the objects in the database mimic the distribution described in the object storage design&lt;br /&gt;
* Sequential workload&lt;br /&gt;
** Seek to a random position in the Shard&lt;br /&gt;
** Read 4MB objects sequentially&lt;br /&gt;
** Assume objects have a size of 4KB (the median object size)&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=File:3.png&amp;diff=1604</id>
		<title>File:3.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=File:3.png&amp;diff=1604"/>
		<updated>2021-08-30T10:29:08Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Latency&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Summary ==&lt;br /&gt;
Latency&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=File:2.png&amp;diff=1603</id>
		<title>File:2.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=File:2.png&amp;diff=1603"/>
		<updated>2021-08-30T10:28:41Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Read&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Summary ==&lt;br /&gt;
Read&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=File:1.png&amp;diff=1602</id>
		<title>File:1.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=File:1.png&amp;diff=1602"/>
		<updated>2021-08-30T10:27:25Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Write&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Summary ==&lt;br /&gt;
Write&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1532</id>
		<title>A practical approach to efficiently store 100 billions small objects in Ceph</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1532"/>
		<updated>2021-03-15T19:03:03Z</updated>

		<summary type="html">&lt;p&gt;Dachary: s/seaweedfs/SeaweedFS/&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [https://en.wikipedia.org/wiki/Software_Heritage Software Heritage] project mission is to collect, preserve and share all software that is available in source code form, with the goal of building a common, shared infrastructure at the service of industry, research, culture and society as a whole. As of February 2021 it contains 10 billions unique source code files (or “objects”, in the following) totaling ~750TB of (uncompressed) data and grows by 50TB every month. 75% of these objects have a size smaller than 16KB and 50% have a size smaller than 4KB. But these small objects only account for ~5% of the 750TB: 25% of the objects have a size &amp;amp;gt; 16KB and occupy ~700TB.&lt;br /&gt;
&lt;br /&gt;
The desired performances for '''10PB''' and '''100 billions objects''' are as follows:&lt;br /&gt;
&lt;br /&gt;
* The clients aggregated together can write at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* The clients aggregated together can read at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* There is no space amplification for small objects.&lt;br /&gt;
* Getting the first byte of any object never takes longer than 100ms.&lt;br /&gt;
* Objects can be enumerated in bulk, at least one million at a time.&lt;br /&gt;
* Mirroring the content of the Software Heritage archive can be done in bulk, at least one million objects at a time.&lt;br /&gt;
&lt;br /&gt;
Using an off-the-shelf object storage such as the [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] or [https://min.io/ MinIO] does not meet the requirements:&lt;br /&gt;
&lt;br /&gt;
* There is a significant space amplification for small objects: at least 25%, depending on the object storage (see “How does packing Objects save space?” below for details)&lt;br /&gt;
* Mirroring the content of the archive can only be done one object at a time and not in bulk which takes at least 10 times longer (see “How does packing Objects help with enumeration?” for details)&lt;br /&gt;
&lt;br /&gt;
A new solution must be implemented by re-using existing components and made available for system administrators to conveniently deploy and maintain in production. There are three ways to do that:&lt;br /&gt;
&lt;br /&gt;
* Contribute packaging and stable releases to a codebase such as [https://github.com/linkedin/ambry Ambry].&lt;br /&gt;
* Modify an object storage such as MinIO to support object packing.&lt;br /&gt;
* Get inspiration from an object storage design such as [https://eos-web.web.cern.ch/eos-web/ EOS] and implement something from scratch.&lt;br /&gt;
&lt;br /&gt;
For reasons explained below (see “Storage solutions and TCO”), it was decided to design a new object storage and implement it from scratch.&lt;br /&gt;
&lt;br /&gt;
= Proposed object storage design =&lt;br /&gt;
&lt;br /&gt;
In a nutshell, objects are written to databases running on a fixed number of machines (the Write Storage) that can vary to control the write throughput. When a threshold is reached (e.g. 100GB) all objects are put together in container (a Shard), and moved to a readonly storage that keeps expanding over time. After a successful write, a unique identifier (the Object ID) is returned to the client. It can be used to read the object back from the readonly storage. Reads scale out because the unique identifiers of the objects embed the name of the container (the Shard UUID). Writes also scales out because the Database is chosen randomly. This is the Layer 0.&lt;br /&gt;
&lt;br /&gt;
Clients that cannot keep track of the name of the container can rely on an API that relies on an index mapping all known objects signatures (the Object HASH below) to the name of the container where they can be found. Although this index prevents scaling out writes, the readonly storage can still scale out by multiplying copies of the index as needed. This is the Layer 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
                      Layer 0 scales out&lt;br /&gt;
&lt;br /&gt;
      +--- write op ----+               +--- read  op ----+&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
   Object &amp;amp;             |               |                 |&lt;br /&gt;
   Object HASH     Object ID         Object ID         Object&lt;br /&gt;
      |            Object HASH          |                 |&lt;br /&gt;
      v            Shard UUID           v                 ^&lt;br /&gt;
      |                 |               |                 |&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage --------+&lt;br /&gt;
    |                           |  |                          |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database |-&amp;gt;--Packing-&amp;gt;----&amp;gt; | Shard |      | Shard | |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database || Database |  |  | | Shard |      | Shard | |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    |                           |  | +-------+      +-------+ |&lt;br /&gt;
    +---------------------------+  | | Shard |      | Shard | |&lt;br /&gt;
                                   | +-------+      +-------+ |&lt;br /&gt;
                                   |            ...           |&lt;br /&gt;
                                   +--------------------------+&lt;br /&gt;
&lt;br /&gt;
                      Layer 1 reads scale out&lt;br /&gt;
&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage ---------+&lt;br /&gt;
    |                           |  |                           |&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    ||Object HASH to Shard UUID||  ||Object HASH to Shard UUID||&lt;br /&gt;
    ||        index            |&amp;gt;&amp;gt;&amp;gt;&amp;gt;|        index            ||&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    +---------------------------+  |+-------------------------+|&lt;br /&gt;
       |                 |         ||Object HASH to Shard UUID||&lt;br /&gt;
       ^                 v         ||        index            ||&lt;br /&gt;
       |                 |         |+-------------------------+|&lt;br /&gt;
       ^                 v         |          ...              |&lt;br /&gt;
     Object              |         +---------------------------+&lt;br /&gt;
   Object HASH           v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |            Object HASH        Object&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       +--- write op ----+                +--- read  op ----+&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* Object: an opaque sequence of bytes.&lt;br /&gt;
* Object HASH: the hash of an Object, e.g., the checksum part of a [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers SWHID].&lt;br /&gt;
* Shard: a group of Objects, used to partition the full set of objects into manageable subsets.&lt;br /&gt;
* Shard UUID: the unique identifier of a Shard, as a [https://en.wikipedia.org/wiki/Universally_unique_identifier UUID].&lt;br /&gt;
* Object ID: a pair made of the Object HASH and the Shard UUID containing the object.&lt;br /&gt;
* Global Index: a table mapping the Object HASH to the Shard UUID that contains the Object.&lt;br /&gt;
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.&lt;br /&gt;
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.&lt;br /&gt;
* Object Storage: the content of the Write Storage and the Read Storage combined.&lt;br /&gt;
* Database: [https://en.wikipedia.org/wiki/PostgreSQL PostgreSQL], [https://en.wikipedia.org/wiki/Apache_Cassandra Cassandra], etc.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Ceph_(software) Ceph]: a self-healing distributed storage.&lt;br /&gt;
* [https://docs.ceph.com/en/latest/rbd/ RBD] image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Total_cost_of_ownership TCO]: Total Cost of Ownership&lt;br /&gt;
&lt;br /&gt;
The key concepts are:&lt;br /&gt;
&lt;br /&gt;
* Packing millions of Objects together in Shards to:&lt;br /&gt;
** save space and,&lt;br /&gt;
** efficiently perform bulk actions such as mirroring or enumerations.&lt;br /&gt;
* Two different storage:&lt;br /&gt;
** Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,&lt;br /&gt;
** Write Storage from which Shards are created and moved to the Read Storage.&lt;br /&gt;
* Identifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.&lt;br /&gt;
&lt;br /&gt;
While the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.&lt;br /&gt;
&lt;br /&gt;
The content of the Object Storage (i.e., the Write Storage and the Read Storage combined) is '''strongly/strictly consistent'''. As soon as an Object is written (i.e., the write operation returns to the client), a reader can get the Object content from the Object Storage (with the caveat that it may require looking up the object from both the Write Storage and Read Storage).&lt;br /&gt;
&lt;br /&gt;
The Read Storage is '''eventually consistent'''. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, up to a given point in time.&lt;br /&gt;
&lt;br /&gt;
== Layer 0 (Object lookup require a complete Object ID) ==&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** A fixed number of Databases&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Shards implemented as Ceph RBD images named after their Shard UUID&lt;br /&gt;
** The content of the Shard uses a format that allows retrieving an Object in O(1) given the Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database is associated with a unique Shard UUID, chosen at random. All Objects written to a Database will be stored in the same Shard.&lt;br /&gt;
&lt;br /&gt;
A successful Object write returns the Object ID. Writing the same object twice may return different Object IDs. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different since it is chosen at random.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
When a Database grows bigger than a threshold (for instance 100GB), it stops accepting writes. A Shard is created in the Read Storage and Objects in the Database are sorted and copied to it. When the Shard is complete, the Database is deleted. Another Database is created, a new Shard UUID is allocated and it starts accepting writes.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
The Shard UUID is extracted from the Object ID. If a Shard exists in the Read Storage, the Object HASH is used to lookup the content of the Object. Otherwise the Database that owns the Shard UUID is looked up in the Write Storage and the Object HASH is used to lookup the content of the Object. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.&lt;br /&gt;
&lt;br /&gt;
== Layer 1 (Objects can be looked up using the Object HASH alone) ==&lt;br /&gt;
&lt;br /&gt;
A Global Index mapping the Object HASH of all known Objects to the Shard UUID is used to:&lt;br /&gt;
&lt;br /&gt;
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.&lt;br /&gt;
* deduplicate identical Objects based on their Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Write Storage and the Read Storage&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Read Storage&lt;br /&gt;
** Multiple readonly replicas of the Global Index of all known Objects in the Read Storage&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH exists in the Read Storage Global Index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage Global Index. There may be duplicate Objects in the Write Storage. It is expected that they race to be inserted in the Write Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
During packing, each Object HASH is looked up in the Read Storage Global Index. If it exists, the object is discarded. Otherwise its Object ID is added to the Read Storage Global Index. When packing is complete:&lt;br /&gt;
&lt;br /&gt;
* Readonly replicas of the Read Storage Global Index are updated with the newly added Object IDs.&lt;br /&gt;
* Object HASH that were found to be duplicate are updated in the Write Storage Global Index. The Object HASH is mapped to the Shard UUID retrieved from the Read Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH is found in the Read Storage Global Index, use the Shard UUID to read the Object content from the Shard found in the Read Storage. Otherwise lookup the Object HASH from the Write Storage Global Index and read the content of the Object from the Database that owns the Shard UUID.&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects save space? =&lt;br /&gt;
&lt;br /&gt;
The short answer is: it does not when Objects are big enough, but it does when there are a lot of small Objects.&lt;br /&gt;
&lt;br /&gt;
If there are billions of objects (i.e., less than one billion is not a lot) and 50% of them have a size smaller than 4KB and 75% of them have a size smaller than 16KB (i.e., bigger than 16KB is not small), then packing will save space.&lt;br /&gt;
&lt;br /&gt;
In the simplest method of packing (i.e., appending each Object after another in a file) and since the Object HASH has a fixed size, the only overhead for each object is the size of the Object (8 bytes). Assuming the Shard containing the Objects is handled as a single 100GB Ceph RBD Image, it adds R bytes. If the underlying Ceph pool is erasure coded k=4,m=2 an additional 50% must be added.&lt;br /&gt;
&lt;br /&gt;
Retrieving an Object from a Shard would be O(n) in this case because there is no index. It is more efficient to [https://en.wikipedia.org/wiki/Perfect_hash_function add a minimal hash table] to the Shard so that finding an object is O(1) instead. That optimization requires an additional 8 bytes per Object to store their offset, i.e. a total of 16 bytes per object.&lt;br /&gt;
&lt;br /&gt;
If Objects are not packed together, each of them requires at least B bytes, which is the minimum space overhead imposed by the underlying storage system. And an additional 50% for durability. The space used by Objects that are smaller than a given threshold will be amplified, depending on the underlying storage. For instance all objects in Ceph have a minimum size of 4KB, therefore the size of a 1KB Object will be amplified to 4KB which translates to a [https://forge.softwareheritage.org/T3052#58864 35% space amplification]. Another example is MinIO with [https://github.com/minio/minio/issues/7395#issuecomment-475161144 over 200% space amplification] or [https://wiki.openstack.org/wiki/Swift/ideas/small_files#Challenges Swift] for which [https://www.ovh.com/blog/dealing-with-small-files-with-openstack-swift-part-2/ packing small files was recently proposed].&lt;br /&gt;
&lt;br /&gt;
To summarize, the overhead of storing M Objects totaling S bytes with M=100 billions and S=10PB is:&lt;br /&gt;
&lt;br /&gt;
* '''packed:''' ~15.5PB&lt;br /&gt;
** (S / 100GB) * R == (10PB / 100GB) * R bytes = 10,000 * R bytes&lt;br /&gt;
** (M * 24) = 100G Objects * 24 bytes = 2.4TB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
* '''not packed:''' ~17.5PB based on the optimistic assumption that the storage system has a 25% space overhead for small files&lt;br /&gt;
** 25% for space amplification = 10PB * 0.25 = 2.5PB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects help with enumeration? =&lt;br /&gt;
&lt;br /&gt;
For mirroring or running an algorithm on all objects, they must be enumerated. If they are not packed together in any way, which is the case with MinIO or Swift, they must be looked up individually. When they are packed together (one million or more), the reader can download an entire Shard instead, saving the accumulated delay imposed by millions of individual lookup.&lt;br /&gt;
&lt;br /&gt;
If looking up an individual Object takes 10 milliseconds and Shards can be read at 100MB/s:&lt;br /&gt;
&lt;br /&gt;
* Getting 1 billion objects requires 10 millions seconds which is over 100 days.&lt;br /&gt;
* One billion objects is 1/10 of the current content of Software Heritage, i.e. ~75TB which can be transferred by reading the Shards in less than 10 days&lt;br /&gt;
&lt;br /&gt;
= Storage solutions and TCO =&lt;br /&gt;
&lt;br /&gt;
When looking for off-the-shelf solutions all options were considered, [https://forge.softwareheritage.org/T3107 including distributed file systems such as IPFs and more] and most of them were discarded because they had at least one blocker that could not be fixed (e.g. no feature to guarantee the durability of an object). In the end a few remained, either including the following features or with the possibility for a third party to contribute them back to the project:&lt;br /&gt;
&lt;br /&gt;
* '''Scale''' to 100 billions objects&lt;br /&gt;
* Provide object '''packing'''&lt;br /&gt;
* Provide detailed '''documentation''' and community support for system administrators operating the storage&lt;br /&gt;
* Be thoroughly '''tested''' before a stable release is published&lt;br /&gt;
* Be '''packaged''' for at least one well known distribution&lt;br /&gt;
* Have '''stable releases''' maintained for at least two years&lt;br /&gt;
* A sound approach to address '''security''' problems (CVE etc.)&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
! Name&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! SeaweedFS&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Scaling&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Packing&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Documentation&lt;br /&gt;
| Good&lt;br /&gt;
| Average&lt;br /&gt;
| Terse&lt;br /&gt;
| Good&lt;br /&gt;
| Good&lt;br /&gt;
| Terse&lt;br /&gt;
|-&lt;br /&gt;
| Tests&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
| Few&lt;br /&gt;
| Average&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
|-&lt;br /&gt;
| Packages&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Stable releases&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Security&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Does not have stable releases and testing ==&lt;br /&gt;
&lt;br /&gt;
The performance goals, size distribution and the number of objects in Software Heritage are similar to what is described in the 2010 article “[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf Finding a needle in Haystack: Facebook’s photo storage]” that motivated the implementation of [https://github.com/chrislusf/seaweedfs SeaweedFS] in 2013 or [https://github.com/linkedin/ambry Ambry], the object storage published in 2017 by LinkedIn to store and serve trillions of media objects in web companies.&lt;br /&gt;
&lt;br /&gt;
Contributing to SeaweedFS or Ambry so they can be deployed and maintained would require:&lt;br /&gt;
&lt;br /&gt;
* Creating packages for the target Operating System (e.g. Debian GNU/Linux), maintaining a repository to distribute them, upload them to the official distribution repository so that they are available in the next stable release (about two years from now)&lt;br /&gt;
* Creating Ansible roles or Puppet modules for deployment on multiple machines&lt;br /&gt;
* Improving the documentation with a configuration and architecture guide to deploy at scale&lt;br /&gt;
* Discuss with upstream to create stable releases, define their lifecycle and organize release management&lt;br /&gt;
* Establish a security team in charge of handling the CVE&lt;br /&gt;
* Setup and infrastructure and create the software for integration testing to be run before a stable release is published to reduce the risk of regressions or data loss. This is specially important because a significant part of the software is dedicated to data storage and replication: bugs can lead to data loss or corruption.&lt;br /&gt;
&lt;br /&gt;
== Does not provide object packing ==&lt;br /&gt;
&lt;br /&gt;
[https://min.io/ MinIO] and [https://docs.openstack.org/swift/latest/ Swift] suffer from a space amplification problem and they do not provide object packing. Although [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] (also known as RGW) stores objects in RocksDB instead of files, it also suffers from a space amplification problem and does not provide object packing.&lt;br /&gt;
&lt;br /&gt;
Contributing to RGW, MinIO or Swift to add object packing would require:&lt;br /&gt;
&lt;br /&gt;
* Creating a blueprint to modify the internals to add object packing&lt;br /&gt;
* Discuss with upstream to validate the blueprint&lt;br /&gt;
* Implement the blueprint and the associated tests&lt;br /&gt;
&lt;br /&gt;
== Does not scale ==&lt;br /&gt;
&lt;br /&gt;
[https://eos-web.web.cern.ch/eos-web/ EOS] is based on Ceph and architectured for packing large objects in [https://docs.ceph.com/en/latest/rbd/ RBD]. However, it is not designed to scale over a few billion objects. Contrary to Ambry, Swift and other similar solutions, it delegates storage to Ceph which make it easier to modify and release without risking data loss or corruption. Instead of modifying EOS to scale to 100 billions objects, it is more practical to:&lt;br /&gt;
&lt;br /&gt;
* Write an EOS alternative from scratch, using the same ideas and adding the desired scalability&lt;br /&gt;
* Package&lt;br /&gt;
* Document&lt;br /&gt;
* Test&lt;br /&gt;
* Publish stable releases&lt;br /&gt;
* Define a security policy&lt;br /&gt;
&lt;br /&gt;
== Estimating the TCO ==&lt;br /&gt;
&lt;br /&gt;
Since no solution can be used as is, some work must be done in each case and the effort it requires should be compared. It is however difficult because the nature of the effort is different. The following factors were considered and aggregated in a TCO estimate.&lt;br /&gt;
&lt;br /&gt;
* '''Data loss risk:''' if a bug in the work done implies the risk of losing data, it makes the work significantly more complicated. It is the case if packing must be implemented in the internals of an existing object storage such as Swift. It is also the case if an object storage does not have integration testing to verify upgrading to a newer version won’t lead to a regression, which is the case with Ambry. It is likely that the Ambry upstream has extensive integration testing but they are not published.&lt;br /&gt;
* '''Large codebase:''' a large codebase means modifying it (to implement packing) or distributing it (packaging and documentation) is more difficult&lt;br /&gt;
* '''Language:''' if the language and its environment is familiar to the developers and the system administrators, the work is less difficult&lt;br /&gt;
* '''Skills:''' if the work requires highly specialized skills (such as an intimate understanding of how a distributed storage system guarantees a strict consistency of the data, or running integration tests that require a cluster of machines) it is more difficult&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! SeaweedFS&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Data loss risk&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Large codebase&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Language&lt;br /&gt;
| C++&lt;br /&gt;
| Python&lt;br /&gt;
| Go&lt;br /&gt;
| Go&lt;br /&gt;
| Python&lt;br /&gt;
| Java&lt;br /&gt;
|-&lt;br /&gt;
| Skills&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|-&lt;br /&gt;
| TCO estimate&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In a nutshell, implementing an alternative to EOS from scratch has the lowest TCO estimate, primarily because it is independent of the underlying distributed storage.&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1517</id>
		<title>A practical approach to efficiently store 100 billions small objects in Ceph</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1517"/>
		<updated>2021-03-10T21:12:23Z</updated>

		<summary type="html">&lt;p&gt;Dachary: formatting&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [https://en.wikipedia.org/wiki/Software_Heritage Software Heritage] project mission is to collect, preserve and share all software that is available in source code form, with the goal of building a common, shared infrastructure at the service of industry, research, culture and society as a whole. As of February 2021 it contains 10 billions unique source code files (or “objects”, in the following) totaling ~750TB of (uncompressed) data and grows by 50TB every month. 75% of these objects have a size smaller than 16KB and 50% have a size smaller than 4KB. But these small objects only account for ~5% of the 750TB: 25% of the objects have a size &amp;amp;gt; 16KB and occupy ~700TB.&lt;br /&gt;
&lt;br /&gt;
The desired performances for '''10PB''' and '''100 billions objects''' are as follows:&lt;br /&gt;
&lt;br /&gt;
* The clients aggregated together can write at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* The clients aggregated together can read at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* There is no space amplification for small objects.&lt;br /&gt;
* Getting the first byte of any object never takes longer than 100ms.&lt;br /&gt;
* Objects can be enumerated in bulk, at least one million at a time.&lt;br /&gt;
* Mirroring the content of the Software Heritage archive can be done in bulk, at least one million objects at a time.&lt;br /&gt;
&lt;br /&gt;
Using an off-the-shelf object storage such as the [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] or [https://min.io/ MinIO] does not meet the requirements:&lt;br /&gt;
&lt;br /&gt;
* There is a significant space amplification for small objects: at least 25%, depending on the object storage (see “How does packing Objects save space?” below for details)&lt;br /&gt;
* Mirroring the content of the archive can only be done one object at a time and not in bulk which takes at least 10 times longer (see “How does packing Objects help with enumeration?” for details)&lt;br /&gt;
&lt;br /&gt;
A new solution must be implemented by re-using existing components and made available for system administrators to conveniently deploy and maintain in production. There are three ways to do that:&lt;br /&gt;
&lt;br /&gt;
* Contribute packaging and stable releases to a codebase such as [https://github.com/linkedin/ambry Ambry].&lt;br /&gt;
* Modify an object storage such as MinIO to support object packing.&lt;br /&gt;
* Get inspiration from an object storage design such as [https://eos-web.web.cern.ch/eos-web/ EOS] and implement something from scratch.&lt;br /&gt;
&lt;br /&gt;
For reasons explained below (see “Storage solutions and TCO”), it was decided to design a new object storage and implement it from scratch.&lt;br /&gt;
&lt;br /&gt;
= Proposed object storage design =&lt;br /&gt;
&lt;br /&gt;
In a nutshell, objects are written to databases running on a fixed number of machines (the Write Storage) that can vary to control the write throughput. When a threshold is reached (e.g. 100GB) all objects are put together in container (a Shard), and moved to a readonly storage that keeps expanding over time. After a successful write, a unique identifier (the Object ID) is returned to the client. It can be used to read the object back from the readonly storage. Reads scale out because the unique identifiers of the objects embed the name of the container (the Shard UUID). Writes also scales out because the Database is chosen randomly. This is the Layer 0.&lt;br /&gt;
&lt;br /&gt;
Clients that cannot keep track of the name of the container can rely on an API that relies on an index mapping all known objects signatures (the Object HASH below) to the name of the container where they can be found. Although this index prevents scaling out writes, the readonly storage can still scale out by multiplying copies of the index as needed. This is the Layer 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
                      Layer 0 scales out&lt;br /&gt;
&lt;br /&gt;
      +--- write op ----+               +--- read  op ----+&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
   Object &amp;amp;             |               |                 |&lt;br /&gt;
   Object HASH     Object ID         Object ID         Object&lt;br /&gt;
      |            Object HASH          |                 |&lt;br /&gt;
      v            Shard UUID           v                 ^&lt;br /&gt;
      |                 |               |                 |&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage --------+&lt;br /&gt;
    |                           |  |                          |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database |-&amp;gt;--Packing-&amp;gt;----&amp;gt; | Shard |      | Shard | |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database || Database |  |  | | Shard |      | Shard | |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    |                           |  | +-------+      +-------+ |&lt;br /&gt;
    +---------------------------+  | | Shard |      | Shard | |&lt;br /&gt;
                                   | +-------+      +-------+ |&lt;br /&gt;
                                   |            ...           |&lt;br /&gt;
                                   +--------------------------+&lt;br /&gt;
&lt;br /&gt;
                      Layer 1 reads scale out&lt;br /&gt;
&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage ---------+&lt;br /&gt;
    |                           |  |                           |&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    ||Object HASH to Shard UUID||  ||Object HASH to Shard UUID||&lt;br /&gt;
    ||        index            |&amp;gt;&amp;gt;&amp;gt;&amp;gt;|        index            ||&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    +---------------------------+  |+-------------------------+|&lt;br /&gt;
       |                 |         ||Object HASH to Shard UUID||&lt;br /&gt;
       ^                 v         ||        index            ||&lt;br /&gt;
       |                 |         |+-------------------------+|&lt;br /&gt;
       ^                 v         |          ...              |&lt;br /&gt;
     Object              |         +---------------------------+&lt;br /&gt;
   Object HASH           v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |            Object HASH        Object&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       +--- write op ----+                +--- read  op ----+&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* Object: an opaque sequence of bytes.&lt;br /&gt;
* Object HASH: the hash of an Object, e.g., the checksum part of a [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers SWHID].&lt;br /&gt;
* Shard: a group of Objects, used to partition the full set of objects into manageable subsets.&lt;br /&gt;
* Shard UUID: the unique identifier of a Shard, as a [https://en.wikipedia.org/wiki/Universally_unique_identifier UUID].&lt;br /&gt;
* Object ID: a pair made of the Object HASH and the Shard UUID containing the object.&lt;br /&gt;
* Global Index: a table mapping the Object HASH to the Shard UUID that contains the Object.&lt;br /&gt;
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.&lt;br /&gt;
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.&lt;br /&gt;
* Object Storage: the content of the Write Storage and the Read Storage combined.&lt;br /&gt;
* Database: [https://en.wikipedia.org/wiki/PostgreSQL PostgreSQL], [https://en.wikipedia.org/wiki/Apache_Cassandra Cassandra], etc.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Ceph_(software) Ceph]: a self-healing distributed storage.&lt;br /&gt;
* [https://docs.ceph.com/en/latest/rbd/ RBD] image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Total_cost_of_ownership TCO]: Total Cost of Ownership&lt;br /&gt;
&lt;br /&gt;
The key concepts are:&lt;br /&gt;
&lt;br /&gt;
* Packing millions of Objects together in Shards to:&lt;br /&gt;
** save space and,&lt;br /&gt;
** efficiently perform bulk actions such as mirroring or enumerations.&lt;br /&gt;
* Two different storage:&lt;br /&gt;
** Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,&lt;br /&gt;
** Write Storage from which Shards are created and moved to the Read Storage.&lt;br /&gt;
* Identifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.&lt;br /&gt;
&lt;br /&gt;
While the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.&lt;br /&gt;
&lt;br /&gt;
The content of the Object Storage (i.e., the Write Storage and the Read Storage combined) is '''strongly/strictly consistent'''. As soon as an Object is written (i.e., the write operation returns to the client), a reader can get the Object content from the Object Storage (with the caveat that it may require looking up the object from both the Write Storage and Read Storage).&lt;br /&gt;
&lt;br /&gt;
The Read Storage is '''eventually consistent'''. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, up to a given point in time.&lt;br /&gt;
&lt;br /&gt;
== Layer 0 (Object lookup require a complete Object ID) ==&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** A fixed number of Databases&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Shards implemented as Ceph RBD images named after their Shard UUID&lt;br /&gt;
** The content of the Shard uses a format that allows retrieving an Object in O(1) given the Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database is associated with a unique Shard UUID, chosen at random. All Objects written to a Database will be stored in the same Shard.&lt;br /&gt;
&lt;br /&gt;
A successful Object write returns the Object ID. Writing the same object twice may return different Object IDs. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different since it is chosen at random.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
When a Database grows bigger than a threshold (for instance 100GB), it stops accepting writes. A Shard is created in the Read Storage and Objects in the Database are sorted and copied to it. When the Shard is complete, the Database is deleted. Another Database is created, a new Shard UUID is allocated and it starts accepting writes.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
The Shard UUID is extracted from the Object ID. If a Shard exists in the Read Storage, the Object HASH is used to lookup the content of the Object. Otherwise the Database that owns the Shard UUID is looked up in the Write Storage and the Object HASH is used to lookup the content of the Object. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.&lt;br /&gt;
&lt;br /&gt;
== Layer 1 (Objects can be looked up using the Object HASH alone) ==&lt;br /&gt;
&lt;br /&gt;
A Global Index mapping the Object HASH of all known Objects to the Shard UUID is used to:&lt;br /&gt;
&lt;br /&gt;
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.&lt;br /&gt;
* deduplicate identical Objects based on their Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Write Storage and the Read Storage&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Read Storage&lt;br /&gt;
** Multiple readonly replicas of the Global Index of all known Objects in the Read Storage&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH exists in the Read Storage Global Index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage Global Index. There may be duplicate Objects in the Write Storage. It is expected that they race to be inserted in the Write Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
During packing, each Object HASH is looked up in the Read Storage Global Index. If it exists, the object is discarded. Otherwise its Object ID is added to the Read Storage Global Index. When packing is complete:&lt;br /&gt;
&lt;br /&gt;
* Readonly replicas of the Read Storage Global Index are updated with the newly added Object IDs.&lt;br /&gt;
* Object HASH that were found to be duplicate are updated in the Write Storage Global Index. The Object HASH is mapped to the Shard UUID retrieved from the Read Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH is found in the Read Storage Global Index, use the Shard UUID to read the Object content from the Shard found in the Read Storage. Otherwise lookup the Object HASH from the Write Storage Global Index and read the content of the Object from the Database that owns the Shard UUID.&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects save space? =&lt;br /&gt;
&lt;br /&gt;
The short answer is: it does not when Objects are big enough, but it does when there are a lot of small Objects.&lt;br /&gt;
&lt;br /&gt;
If there are billions of objects (i.e., less than one billion is not a lot) and 50% of them have a size smaller than 4KB and 75% of them have a size smaller than 16KB (i.e., bigger than 16KB is not small), then packing will save space.&lt;br /&gt;
&lt;br /&gt;
In the simplest method of packing (i.e., appending each Object after another in a file) and since the Object HASH has a fixed size, the only overhead for each object is the size of the Object (8 bytes). Assuming the Shard containing the Objects is handled as a single 100GB Ceph RBD Image, it adds R bytes. If the underlying Ceph pool is erasure coded k=4,m=2 an additional 50% must be added.&lt;br /&gt;
&lt;br /&gt;
Retrieving an Object from a Shard would be O(n) in this case because there is no index. It is more efficient to [https://en.wikipedia.org/wiki/Perfect_hash_function add a minimal hash table] to the Shard so that finding an object is O(1) instead. That optimization requires an additional 8 bytes per Object to store their offset, i.e. a total of 16 bytes per object.&lt;br /&gt;
&lt;br /&gt;
If Objects are not packed together, each of them requires at least B bytes, which is the minimum space overhead imposed by the underlying storage system. And an additional 50% for durability. The space used by Objects that are smaller than a given threshold will be amplified, depending on the underlying storage. For instance all objects in Ceph have a minimum size of 4KB, therefore the size of a 1KB Object will be amplified to 4KB which translates to a [https://forge.softwareheritage.org/T3052#58864 35% space amplification]. Another example is MinIO with [https://github.com/minio/minio/issues/7395#issuecomment-475161144 over 200% space amplification] or [https://wiki.openstack.org/wiki/Swift/ideas/small_files#Challenges Swift] for which [https://www.ovh.com/blog/dealing-with-small-files-with-openstack-swift-part-2/ packing small files was recently proposed].&lt;br /&gt;
&lt;br /&gt;
To summarize, the overhead of storing M Objects totaling S bytes with M=100 billions and S=10PB is:&lt;br /&gt;
&lt;br /&gt;
* '''packed:''' ~15.5PB&lt;br /&gt;
** (S / 100GB) * R == (10PB / 100GB) * R bytes = 10,000 * R bytes&lt;br /&gt;
** (M * 24) = 100G Objects * 24 bytes = 2.4TB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
* '''not packed:''' ~17.5PB based on the optimistic assumption that the storage system has a 25% space overhead for small files&lt;br /&gt;
** 25% for space amplification = 10PB * 0.25 = 2.5PB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects help with enumeration? =&lt;br /&gt;
&lt;br /&gt;
For mirroring or running an algorithm on all objects, they must be enumerated. If they are not packed together in any way, which is the case with MinIO or Swift, they must be looked up individually. When they are packed together (one million or more), the reader can download an entire Shard instead, saving the accumulated delay imposed by millions of individual lookup.&lt;br /&gt;
&lt;br /&gt;
If looking up an individual Object takes 10 milliseconds and Shards can be read at 100MB/s:&lt;br /&gt;
&lt;br /&gt;
* Getting 1 billion objects requires 10 millions seconds which is over 100 days.&lt;br /&gt;
* One billion objects is 1/10 of the current content of Software Heritage, i.e. ~75TB which can be transferred by reading the Shards in less than 10 days&lt;br /&gt;
&lt;br /&gt;
= Storage solutions and TCO =&lt;br /&gt;
&lt;br /&gt;
When looking for off-the-shelf solutions all options were considered, [https://forge.softwareheritage.org/T3107 including distributed file systems such as IPFs and more] and most of them were discarded because they had at least one blocker that could not be fixed (e.g. no feature to guarantee the durability of an object). In the end a few remained, either including the following features or with the possibility for a third party to contribute them back to the project:&lt;br /&gt;
&lt;br /&gt;
* '''Scale''' to 100 billions objects&lt;br /&gt;
* Provide object '''packing'''&lt;br /&gt;
* Provide detailed '''documentation''' and community support for system administrators operating the storage&lt;br /&gt;
* Be thoroughly '''tested''' before a stable release is published&lt;br /&gt;
* Be '''packaged''' for at least one well known distribution&lt;br /&gt;
* Have '''stable releases''' maintained for at least two years&lt;br /&gt;
* A sound approach to address '''security''' problems (CVE etc.)&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
! Name&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! Seaweedfs&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Scaling&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Packing&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Documentation&lt;br /&gt;
| Good&lt;br /&gt;
| Average&lt;br /&gt;
| Terse&lt;br /&gt;
| Good&lt;br /&gt;
| Good&lt;br /&gt;
| Terse&lt;br /&gt;
|-&lt;br /&gt;
| Tests&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
| Few&lt;br /&gt;
| Average&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
|-&lt;br /&gt;
| Packages&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Stable releases&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Security&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Does not have stable releases and testing ==&lt;br /&gt;
&lt;br /&gt;
The performance goals, size distribution and the number of objects in Software Heritage are similar to what is described in the 2010 article “[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf Finding a needle in Haystack: Facebook’s photo storage]” that motivated the implementation of [https://github.com/chrislusf/seaweedfs Seaweedfs] in 2013 or [https://github.com/linkedin/ambry Ambry], the object storage published in 2017 by LinkedIn to store and serve trillions of media objects in web companies.&lt;br /&gt;
&lt;br /&gt;
Contributing to Seaweedfs or Ambry so they can be deployed and maintained would require:&lt;br /&gt;
&lt;br /&gt;
* Creating packages for the target Operating System (e.g. Debian GNU/Linux), maintaining a repository to distribute them, upload them to the official distribution repository so that they are available in the next stable release (about two years from now)&lt;br /&gt;
* Creating Ansible roles or Puppet modules for deployment on multiple machines&lt;br /&gt;
* Improving the documentation with a configuration and architecture guide to deploy at scale&lt;br /&gt;
* Discuss with upstream to create stable releases, define their lifecycle and organize release management&lt;br /&gt;
* Establish a security team in charge of handling the CVE&lt;br /&gt;
* Setup and infrastructure and create the software for integration testing to be run before a stable release is published to reduce the risk of regressions or data loss. This is specially important because a significant part of the software is dedicated to data storage and replication: bugs can lead to data loss or corruption.&lt;br /&gt;
&lt;br /&gt;
== Does not provide object packing ==&lt;br /&gt;
&lt;br /&gt;
[https://min.io/ MinIO] and [https://docs.openstack.org/swift/latest/ Swift] suffer from a space amplification problem and they do not provide object packing. Although [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] (also known as RGW) stores objects in RocksDB instead of files, it also suffers from a space amplification problem and does not provide object packing.&lt;br /&gt;
&lt;br /&gt;
Contributing to RGW, MinIO or Swift to add object packing would require:&lt;br /&gt;
&lt;br /&gt;
* Creating a blueprint to modify the internals to add object packing&lt;br /&gt;
* Discuss with upstream to validate the blueprint&lt;br /&gt;
* Implement the blueprint and the associated tests&lt;br /&gt;
&lt;br /&gt;
== Does not scale ==&lt;br /&gt;
&lt;br /&gt;
[https://eos-web.web.cern.ch/eos-web/ EOS] is based on Ceph and architectured for packing large objects in [https://docs.ceph.com/en/latest/rbd/ RBD]. However, it is not designed to scale over a few billion objects. Contrary to Ambry, Swift and other similar solutions, it delegates storage to Ceph which make it easier to modify and release without risking data loss or corruption. Instead of modifying EOS to scale to 100 billions objects, it is more practical to:&lt;br /&gt;
&lt;br /&gt;
* Write an EOS alternative from scratch, using the same ideas and adding the desired scalability&lt;br /&gt;
* Package&lt;br /&gt;
* Document&lt;br /&gt;
* Test&lt;br /&gt;
* Publish stable releases&lt;br /&gt;
* Define a security policy&lt;br /&gt;
&lt;br /&gt;
== Estimating the TCO ==&lt;br /&gt;
&lt;br /&gt;
Since no solution can be used as is, some work must be done in each case and the effort it requires should be compared. It is however difficult because the nature of the effort is different. The following factors were considered and aggregated in a TCO estimate.&lt;br /&gt;
&lt;br /&gt;
* '''Data loss risk:''' if a bug in the work done implies the risk of losing data, it makes the work significantly more complicated. It is the case if packing must be implemented in the internals of an existing object storage such as Swift. It is also the case if an object storage does not have integration testing to verify upgrading to a newer version won’t lead to a regression, which is the case with Ambry. It is likely that the Ambry upstream has extensive integration testing but they are not published.&lt;br /&gt;
* '''Large codebase:''' a large codebase means modifying it (to implement packing) or distributing it (packaging and documentation) is more difficult&lt;br /&gt;
* '''Language:''' if the language and its environment is familiar to the developers and the system administrators, the work is less difficult&lt;br /&gt;
* '''Skills:''' if the work requires highly specialized skills (such as an intimate understanding of how a distributed storage system guarantees a strict consistency of the data, or running integration tests that require a cluster of machines) it is more difficult&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! Seaweedfs&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Data loss risk&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Large codebase&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Language&lt;br /&gt;
| C++&lt;br /&gt;
| Python&lt;br /&gt;
| Go&lt;br /&gt;
| Go&lt;br /&gt;
| Python&lt;br /&gt;
| Java&lt;br /&gt;
|-&lt;br /&gt;
| Skills&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|-&lt;br /&gt;
| TCO estimate&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In a nutshell, implementing an alternative to EOS from scratch has the lowest TCO estimate, primarily because it is independent of the underlying distributed storage.&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1516</id>
		<title>A practical approach to efficiently store 100 billions small objects in Ceph</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1516"/>
		<updated>2021-03-10T14:55:12Z</updated>

		<summary type="html">&lt;p&gt;Dachary: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [https://en.wikipedia.org/wiki/Software_Heritage Software Heritage] project mission is to collect, preserve and share all software that is available in source code form, with the goal of building a common, shared infrastructure at the service of industry, research, culture and society as a whole. As of February 2021 it contains 10 billions unique source code files (or “objects”, in the following) totaling ~750TB of (uncompressed) data and grows by 50TB every month. 75% of these objects have a size smaller than 16KB and 50% have a size smaller than 4KB. But these small objects only account for ~5% of the 750TB: 25% of the objects have a size &amp;amp;gt; 16KB and occupy ~700TB.&lt;br /&gt;
&lt;br /&gt;
The desired performances for '''10PB''' and '''100 billions objects''' are as follows:&lt;br /&gt;
&lt;br /&gt;
* The clients aggregated together can write at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* The clients aggregated together can read at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* There is no space amplification for small objects.&lt;br /&gt;
* Getting the first byte of any object never takes longer than 100ms.&lt;br /&gt;
* Objects can be enumerated in bulk, at least one million at a time.&lt;br /&gt;
* Mirroring the content of the Software Heritage archive can be done in bulk, at least one million objects at a time.&lt;br /&gt;
&lt;br /&gt;
Using an off-the-shelf object storage such as the [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] or [https://min.io/ MinIO] does not meet the requirements:&lt;br /&gt;
&lt;br /&gt;
* There is a significant space amplification for small objects: at least 25%, depending on the object storage (see “How does packing Objects save space?” below for details)&lt;br /&gt;
* Mirroring the content of the archive can only be done one object at a time and not in bulk which takes at least 10 times longer (see “How does packing Objects help with enumeration?” for details)&lt;br /&gt;
&lt;br /&gt;
A new solution must be implemented by re-using existing components and made available for system administrators to conveniently deploy and maintain in production. There are three ways to do that:&lt;br /&gt;
&lt;br /&gt;
* Contribute packaging and stable releases to a codebase such as [https://github.com/linkedin/ambry Ambry].&lt;br /&gt;
* Modify an object storage such as MinIO to support object packing.&lt;br /&gt;
* Get inspiration from an object storage design such as [https://eos-web.web.cern.ch/eos-web/ EOS] and implement something from scratch.&lt;br /&gt;
&lt;br /&gt;
For reasons explained below (see “Storage solutions and TCO”), it was decided to design a new object storage and implement it from scratch.&lt;br /&gt;
&lt;br /&gt;
= Proposed object storage design =&lt;br /&gt;
&lt;br /&gt;
In a nutshell, objects are written to databases running on a fixed number of machines (the Write Storage) that can vary to control the write throughput. When a threshold is reached (e.g. 100GB) all objects are put together in container (a Shard), and moved to a readonly storage that keeps expanding over time. After a successful write, a unique identifier (the Object ID) is returned to the client. It can be used to read the object back from the readonly storage. Reads scale out because the unique identifiers of the objects embed the name of the container (the Shard UUID). Writes also scales out because the Database is chosen randomly. This is the Layer 0.&lt;br /&gt;
&lt;br /&gt;
Clients that cannot keep track of the name of the container can rely on an API that relies on an index mapping all known objects signatures (the Object HASH below) to the name of the container where they can be found. Although this index prevents scaling out writes, the readonly storage can still scale out by multiplying copies of the index as needed. This is the Layer 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
                      Layer 0 scales out&lt;br /&gt;
&lt;br /&gt;
      +--- write op ----+               +--- read  op ----+&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
   Object &amp;amp;             |               |                 |&lt;br /&gt;
   Object HASH     Object ID         Object ID         Object&lt;br /&gt;
      |            Object HASH          |                 |&lt;br /&gt;
      v            Shard UUID           v                 ^&lt;br /&gt;
      |                 |               |                 |&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage --------+&lt;br /&gt;
    |                           |  |                          |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database |-&amp;gt;--Packing-&amp;gt;----&amp;gt; | Shard |      | Shard | |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database || Database |  |  | | Shard |      | Shard | |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    |                           |  | +-------+      +-------+ |&lt;br /&gt;
    +---------------------------+  | | Shard |      | Shard | |&lt;br /&gt;
                                   | +-------+      +-------+ |&lt;br /&gt;
                                   |            ...           |&lt;br /&gt;
                                   +--------------------------+&lt;br /&gt;
&lt;br /&gt;
                      Layer 1 reads scale out&lt;br /&gt;
&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage ---------+&lt;br /&gt;
    |                           |  |                           |&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    ||Object HASH to Shard UUID||  ||Object HASH to Shard UUID||&lt;br /&gt;
    ||        index            |&amp;gt;&amp;gt;&amp;gt;&amp;gt;|        index            ||&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    +---------------------------+  |+-------------------------+|&lt;br /&gt;
       |                 |         ||Object HASH to Shard UUID||&lt;br /&gt;
       ^                 v         ||        index            ||&lt;br /&gt;
       |                 |         |+-------------------------+|&lt;br /&gt;
       ^                 v         |          ...              |&lt;br /&gt;
     Object              |         +---------------------------+&lt;br /&gt;
   Object HASH           v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |            Object HASH        Object&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       +--- write op ----+                +--- read  op ----+&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* Object: an opaque sequence of bytes.&lt;br /&gt;
* Object HASH: the hash of an Object, e.g., the checksum part of a [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers SWHID].&lt;br /&gt;
* Shard: a group of Objects, used to partition the full set of objects into manageable subsets.&lt;br /&gt;
* Shard UUID: the unique identifier of a Shard, as a [https://en.wikipedia.org/wiki/Universally_unique_identifier UUID].&lt;br /&gt;
* Object ID: a pair made of the Object HASH and the Shard UUID containing the object.&lt;br /&gt;
* Global Index: a table mapping the Object HASH to the Shard UUID that contains the Object.&lt;br /&gt;
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.&lt;br /&gt;
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.&lt;br /&gt;
* Object Storage: the content of the Write Storage and the Read Storage combined.&lt;br /&gt;
* Database: [https://en.wikipedia.org/wiki/PostgreSQL PostgreSQL], [https://en.wikipedia.org/wiki/Apache_Cassandra Cassandra], etc.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Ceph_(software) Ceph]: a self-healing distributed storage.&lt;br /&gt;
* [https://docs.ceph.com/en/latest/rbd/ RBD] image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Total_cost_of_ownership TCO]: Total Cost of Ownership&lt;br /&gt;
&lt;br /&gt;
The key concepts are:&lt;br /&gt;
&lt;br /&gt;
* Packing millions of Objects together in Shards to:&lt;br /&gt;
** save space and,&lt;br /&gt;
** efficiently perform bulk actions such as mirroring or enumerations.&lt;br /&gt;
* Two different storage:&lt;br /&gt;
** Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,&lt;br /&gt;
** Write Storage from which Shards are created and moved to the Read Storage.&lt;br /&gt;
* Identifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.&lt;br /&gt;
&lt;br /&gt;
While the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.&lt;br /&gt;
&lt;br /&gt;
The content of the Object Storage (i.e., the Write Storage and the Read Storage combined) is '''strongly/strictly consistent'''. As soon as an Object is written (i.e., the write operation returns to the client), a reader can get the Object content from the Object Storage (with the caveat that it may require looking up the object from both the Write Storage and Read Storage).&lt;br /&gt;
&lt;br /&gt;
The Read Storage is '''eventually consistent'''. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, up to a given point in time.&lt;br /&gt;
&lt;br /&gt;
== Layer 0 (Object lookup require a complete Object ID) ==&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** A fixed number of Databases&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Shards implemented as Ceph RBD images named after their Shard UUID&lt;br /&gt;
** The content of the Shard uses a format that allows retrieving an Object in O(1) given the Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database is associated with a unique Shard UUID, chosen at random. All Objects written to a Database will be stored in the same Shard.&lt;br /&gt;
&lt;br /&gt;
A successful Object write returns the Object ID. Writing the same object twice may return different Object IDs. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different since it is chosen at random.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
When a Database grows bigger than a threshold (for instance 100GB), it stops accepting writes. A Shard is created in the Read Storage and Objects in the Database are sorted and copied to it. When the Shard is complete, the Database is deleted. Another Database is created, a new Shard UUID is allocated and it starts accepting writes.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
The Shard UUID is extracted from the Object ID. If a Shard exists in the Read Storage, the Object HASH is used to lookup the content of the Object. Otherwise the Database that owns the Shard UUID is looked up in the Write Storage and the Object HASH is used to lookup the content of the Object. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.&lt;br /&gt;
&lt;br /&gt;
== Layer 1 (Objects can be looked up using the Object HASH alone) ==&lt;br /&gt;
&lt;br /&gt;
A Global Index mapping the Object HASH of all known Objects to the Shard UUID is used to:&lt;br /&gt;
&lt;br /&gt;
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.&lt;br /&gt;
* deduplicate identical Objects based on their Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Write Storage and the Read Storage&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Read Storage&lt;br /&gt;
** Multiple readonly replicas of the Global Index of all known Objects in the Read Storage&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH exists in the Read Storage Global Index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage Global Index. There may be duplicate Objects in the Write Storage. It is expected that they race to be inserted in the Write Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
During packing, each Object HASH is looked up in the Read Storage Global Index. If it exists, the object is discarded. Otherwise its Object ID is added to the Read Storage Global Index. When packing is complete:&lt;br /&gt;
&lt;br /&gt;
* Readonly replicas of the Read Storage Global Index are updated with the newly added Object IDs.&lt;br /&gt;
* Object HASH that were found to be duplicate are updated in the Write Storage Global Index. The Object HASH is mapped to the Shard UUID retrieved from the Read Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH is found in the Read Storage Global Index, use the Shard UUID to read the Object content from the Shard found in the Read Storage. Otherwise lookup the Object HASH from the Write Storage Global Index and read the content of the Object from the Database that owns the Shard UUID.&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects save space? =&lt;br /&gt;
&lt;br /&gt;
The short answer is: it does not when Objects are big enough, but it does when there are a lot of small Objects.&lt;br /&gt;
&lt;br /&gt;
If there are billions of objects (i.e., less than one billion is not a lot) and 50% of them have a size smaller than 4KB and 75% of them have a size smaller than 16KB (i.e., bigger than 16KB is not small), then packing will save space.&lt;br /&gt;
&lt;br /&gt;
In the simplest method of packing (i.e., appending each Object after another in a file) and since the Object HASH has a fixed size, the only overhead for each object is the size of the Object (8 bytes). Assuming the Shard containing the Objects is handled as a single 100GB Ceph RBD Image, it adds R bytes. If the underlying Ceph pool is erasure coded k=4,m=2 an additional 50% must be added.&lt;br /&gt;
&lt;br /&gt;
Retrieving an Object from a Shard would be O(n) in this case because there is no index. It is more efficient to [https://en.wikipedia.org/wiki/Perfect_hash_function add a minimal hash table] to the Shard so that finding an object is O(1) instead. That optimization requires an additional 8 bytes per Object to store their offset, i.e. a total of 16 bytes per object.&lt;br /&gt;
&lt;br /&gt;
If Objects are not packed together, each of them requires at least B bytes, which is the minimum space overhead imposed by the underlying storage system. And an additional 50% for durability. The space used by Objects that are smaller than a given threshold will be amplified, depending on the underlying storage. For instance all objects in Ceph have a minimum size of 4KB, therefore the size of a 1KB Object will be amplified to 4KB which translates to a [https://forge.softwareheritage.org/T3052#58864 35% space amplification]. Another example is MinIO with [https://github.com/minio/minio/issues/7395#issuecomment-475161144 over 200% space amplification] or [https://wiki.openstack.org/wiki/Swift/ideas/small_files#Challenges Swift] for which [https://www.ovh.com/blog/dealing-with-small-files-with-openstack-swift-part-2/ packing small files was recently proposed].&lt;br /&gt;
&lt;br /&gt;
To summarize, the overhead of storing M Objects totaling S bytes with M=100 billions and S=10PB is:&lt;br /&gt;
&lt;br /&gt;
* '''packed:''' ~15.5PB&lt;br /&gt;
** (S / 100GB) * R == (10PB / 100GB) * R bytes = 10,000 * R bytes&lt;br /&gt;
** (M * 24) = 100G Objects * 24 bytes = 2.4TB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
* '''not packed:''' ~17.5PB based on the optimistic assumption that the storage system has a 25% space overhead for small files&lt;br /&gt;
** 25% for space amplification = 10PB * 0.25 = 2.5PB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects help with enumeration? =&lt;br /&gt;
&lt;br /&gt;
For mirroring or running an algorithm on all objects, they must be enumerated. If they are not packed together in any way, which is the case with MinIO or Swift, they must be looked up individually. When they are packed together (one million or more), the reader can download an entire Shard instead, saving the accumulated delay imposed by millions of individual lookup.&lt;br /&gt;
&lt;br /&gt;
If looking up an individual Object takes 10 milliseconds and Shards can be read at 100MB/s:&lt;br /&gt;
&lt;br /&gt;
* Getting 1 billion objects requires 10 millions seconds which is over 100 days.&lt;br /&gt;
* One billion objects is 1/10 of the current content of Software Heritage, i.e. ~75TB which can be transferred by reading the Shards in less than 10 days&lt;br /&gt;
&lt;br /&gt;
= Storage solutions and TCO =&lt;br /&gt;
&lt;br /&gt;
When looking for off-the-shelf solutions all options were considered, [https://forge.softwareheritage.org/T3107 including distributed file systems such as IPFs and more] and most of them were discarded because they had at least one blocker that could not be fixed (e.g. no feature to guarantee the durability of an object). In the end a few remained, either including the following features or with the possibility for a third party to contribute them back to the project:&lt;br /&gt;
&lt;br /&gt;
* **Scale* to 100 billions objects&lt;br /&gt;
* Provide object '''packing'''&lt;br /&gt;
* Provide detailed '''documentation''' and community support for system administrators operating the storage&lt;br /&gt;
* Be thoroughly '''tested''' before a stable release is published&lt;br /&gt;
* Be '''packaged''' for at least one well known distribution&lt;br /&gt;
* Have '''stable releases''' maintained for at least two years&lt;br /&gt;
* A sound approach to address '''security''' problems (CVE etc.)&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
! Name&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! Seaweedfs&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Scaling&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Packing&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Documentation&lt;br /&gt;
| Good&lt;br /&gt;
| Average&lt;br /&gt;
| Terse&lt;br /&gt;
| Good&lt;br /&gt;
| Good&lt;br /&gt;
| Terse&lt;br /&gt;
|-&lt;br /&gt;
| Tests&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
| Few&lt;br /&gt;
| Average&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
|-&lt;br /&gt;
| Packages&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Stable releases&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Security&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Does not have stable releases and testing ==&lt;br /&gt;
&lt;br /&gt;
The performance goals, size distribution and the number of objects in Software Heritage are similar to what is described in the 2010 article “[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf Finding a needle in Haystack: Facebook’s photo storage]” that motivated the implementation of [https://github.com/chrislusf/seaweedfs Seaweedfs] in 2013 or [https://github.com/linkedin/ambry Ambry], the object storage published in 2017 by LinkedIn to store and serve trillions of media objects in web companies.&lt;br /&gt;
&lt;br /&gt;
Contributing to Seaweedfs or Ambry so they can be deployed and maintained would require:&lt;br /&gt;
&lt;br /&gt;
* Creating packages for the target Operating System (e.g. Debian GNU/Linux), maintaining a repository to distribute them, upload them to the official distribution repository so that they are available in the next stable release (about two years from now)&lt;br /&gt;
* Creating Ansible roles or Puppet modules for deployment on multiple machines&lt;br /&gt;
* Improving the documentation with a configuration and architecture guide to deploy at scale&lt;br /&gt;
* Discuss with upstream to create stable releases, define their lifecycle and organize release management&lt;br /&gt;
* Establish a security team in charge of handling the CVE&lt;br /&gt;
* Setup and infrastructure and create the software for integration testing to be run before a stable release is published to reduce the risk of regressions or data loss. This is specially important because a significant part of the software is dedicated to data storage and replication: bugs can lead to data loss or corruption.&lt;br /&gt;
&lt;br /&gt;
== Does not provide object packing ==&lt;br /&gt;
&lt;br /&gt;
[https://min.io/ MinIO] and [https://docs.openstack.org/swift/latest/ Swift] suffer from a space amplification problem and they do not provide object packing. Although [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] (also known as RGW) stores objects in RocksDB instead of files, it also suffers from a space amplification problem and does not provide object packing.&lt;br /&gt;
&lt;br /&gt;
Contributing to RGW, MinIO or Swift to add object packing would require:&lt;br /&gt;
&lt;br /&gt;
* Creating a blueprint to modify the internals to add object packing&lt;br /&gt;
* Discuss with upstream to validate the blueprint&lt;br /&gt;
* Implement the blueprint and the associated tests&lt;br /&gt;
&lt;br /&gt;
== Does not scale ==&lt;br /&gt;
&lt;br /&gt;
[https://eos-web.web.cern.ch/eos-web/ EOS] is based on Ceph and architectured for packing large objects in [https://docs.ceph.com/en/latest/rbd/ RBD]. However, it is not designed to scale over a few billion objects. Contrary to Ambry, Swift and other similar solutions, it delegates storage to Ceph which make it easier to modify and release without risking data loss or corruption. Instead of modifying EOS to scale to 100 billions objects, it is more practical to:&lt;br /&gt;
&lt;br /&gt;
* Write an EOS alternative from scratch, using the same ideas and adding the desired scalability&lt;br /&gt;
* Package&lt;br /&gt;
* Document&lt;br /&gt;
* Test&lt;br /&gt;
* Publish stable releases&lt;br /&gt;
* Define a security policy&lt;br /&gt;
&lt;br /&gt;
== Estimating the TCO ==&lt;br /&gt;
&lt;br /&gt;
Since no solution can be used as is, some work must be done in each case and the effort it requires should be compared. It is however difficult because the nature of the effort is different. The following factors were considered and aggregated in a TCO estimate.&lt;br /&gt;
&lt;br /&gt;
* '''Data loss risk:''' if a bug in the work done implies the risk of losing data, it makes the work significantly more complicated. It is the case if packing must be implemented in the internals of an existing object storage such as Swift. It is also the case if an object storage does not have integration testing to verify upgrading to a newer version won’t lead to a regression, which is the case with Ambry. It is likely that the Ambry upstream has extensive integration testing but they are not published.&lt;br /&gt;
* '''Large codebase:''' a large codebase means modifying it (to implement packing) or distributing it (packaging and documentation) is more difficult&lt;br /&gt;
* '''Language:''' if the language and its environment is familiar to the developers and the system administrators, the work is less difficult&lt;br /&gt;
* '''Skills:''' if the work requires highly specialized skills (such as an intimate understanding of how a distributed storage system guarantees a strict consistency of the data, or running integration tests that require a cluster of machines) it is more difficult&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! Seaweedfs&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Data loss risk&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Large codebase&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Language&lt;br /&gt;
| C++&lt;br /&gt;
| Python&lt;br /&gt;
| Go&lt;br /&gt;
| Go&lt;br /&gt;
| Python&lt;br /&gt;
| Java&lt;br /&gt;
|-&lt;br /&gt;
| Skills&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|-&lt;br /&gt;
| TCO estimate&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In a nutshell, implementing an alternative to EOS from scratch has the lowest TCO estimate, primarily because it is independent of the underlying distributed storage.&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1515</id>
		<title>A practical approach to efficiently store 100 billions small objects in Ceph</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1515"/>
		<updated>2021-03-10T13:13:46Z</updated>

		<summary type="html">&lt;p&gt;Dachary: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= A practical approach to efficiently store 100 billions small objects in Ceph =&lt;br /&gt;
&lt;br /&gt;
The [https://en.wikipedia.org/wiki/Software_Heritage Software Heritage] project mission is to collect, preserve and share all software that is available in source code form, with the goal of building a common, shared infrastructure at the service of industry, research, culture and society as a whole. As of February 2021 it contains 10 billions unique source code files (or “objects”, in the following) totaling ~750TB of (uncompressed) data and grows by 50TB every month. 75% of these objects have a size smaller than 16KB and 50% have a size smaller than 4KB. But these small objects only account for ~5% of the 750TB: 25% of the objects have a size &amp;amp;gt; 16KB and occupy ~700TB.&lt;br /&gt;
&lt;br /&gt;
The desired performances for '''10PB''' and '''100 billions objects''' are as follows:&lt;br /&gt;
&lt;br /&gt;
* The clients aggregated together can write at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* The clients aggregated together can read at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* There is no space amplification for small objects.&lt;br /&gt;
* Getting the first byte of any object never takes longer than 100ms.&lt;br /&gt;
* Objects can be enumerated in bulk, at least one million at a time.&lt;br /&gt;
* Mirroring the content of the Software Heritage archive can be done in bulk, at least one million objects at a time.&lt;br /&gt;
&lt;br /&gt;
Using an off-the-shelf object storage such as the [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] or [https://min.io/ MinIO] does not meet the requirements:&lt;br /&gt;
&lt;br /&gt;
* There is a significant space amplification for small objects: at least 25%, depending on the object storage (see “How does packing Objects save space?” below for details)&lt;br /&gt;
* Mirroring the content of the archive can only be done one object at a time and not in bulk which takes at least 10 times longer (see “How does packing Objects help with enumeration?” for details)&lt;br /&gt;
&lt;br /&gt;
A new solution must be implemented by re-using existing components and made available for system administrators to conveniently deploy and maintain in production. There are three ways to do that:&lt;br /&gt;
&lt;br /&gt;
* Contribute packaging and stable releases to a codebase such as [https://github.com/linkedin/ambry Ambry].&lt;br /&gt;
* Modify an object storage such as MinIO to support object packing.&lt;br /&gt;
* Get inspiration from an object storage design such as [https://eos-web.web.cern.ch/eos-web/ EOS] and implement something from scratch.&lt;br /&gt;
&lt;br /&gt;
For reasons explained below (see “Storage solutions and TCO”), it was decided to design a new object storage and implement it from scratch.&lt;br /&gt;
&lt;br /&gt;
= Proposed object storage design =&lt;br /&gt;
&lt;br /&gt;
In a nutshell, objects are written to databases running on a fixed number of machines (the Write Storage) that can vary to control the write throughput. When a threshold is reached (e.g. 100GB) all objects are put together in container (a Shard), and moved to a readonly storage that keeps expanding over time. After a successful write, a unique identifier (the Object ID) is returned to the client. It can be used to read the object back from the readonly storage. Reads scale out because the unique identifiers of the objects embed the name of the container (the Shard UUID). Writes also scales out because the Database is chosen randomly. This is the Layer 0.&lt;br /&gt;
&lt;br /&gt;
Clients that cannot keep track of the name of the container can rely on an API that relies on an index mapping all known objects signatures (the Object HASH below) to the name of the container where they can be found. Although this index prevents scaling out writes, the readonly storage can still scale out by multiplying copies of the index as needed. This is the Layer 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
                      Layer 0 scales out&lt;br /&gt;
&lt;br /&gt;
      +--- write op ----+               +--- read  op ----+&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
   Object &amp;amp;             |               |                 |&lt;br /&gt;
   Object HASH     Object ID         Object ID         Object&lt;br /&gt;
      |            Object HASH          |                 |&lt;br /&gt;
      v            Shard UUID           v                 ^&lt;br /&gt;
      |                 |               |                 |&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage --------+&lt;br /&gt;
    |                           |  |                          |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database |-&amp;gt;--Packing-&amp;gt;----&amp;gt; | Shard |      | Shard | |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database || Database |  |  | | Shard |      | Shard | |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    |                           |  | +-------+      +-------+ |&lt;br /&gt;
    +---------------------------+  | | Shard |      | Shard | |&lt;br /&gt;
                                   | +-------+      +-------+ |&lt;br /&gt;
                                   |            ...           |&lt;br /&gt;
                                   +--------------------------+&lt;br /&gt;
&lt;br /&gt;
                      Layer 1 reads scale out&lt;br /&gt;
&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage ---------+&lt;br /&gt;
    |                           |  |                           |&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    ||Object HASH to Shard UUID||  ||Object HASH to Shard UUID||&lt;br /&gt;
    ||        index            |&amp;gt;&amp;gt;&amp;gt;&amp;gt;|        index            ||&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    +---------------------------+  |+-------------------------+|&lt;br /&gt;
       |                 |         ||Object HASH to Shard UUID||&lt;br /&gt;
       ^                 v         ||        index            ||&lt;br /&gt;
       |                 |         |+-------------------------+|&lt;br /&gt;
       ^                 v         |          ...              |&lt;br /&gt;
     Object              |         +---------------------------+&lt;br /&gt;
   Object HASH           v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |            Object HASH        Object&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       +--- write op ----+                +--- read  op ----+&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* Object: an opaque sequence of bytes.&lt;br /&gt;
* Object HASH: the hash of an Object, e.g., the checksum part of a [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers SWHID].&lt;br /&gt;
* Shard: a group of Objects, used to partition the full set of objects into manageable subsets.&lt;br /&gt;
* Shard UUID: the unique identifier of a Shard, as a [https://en.wikipedia.org/wiki/Universally_unique_identifier UUID].&lt;br /&gt;
* Object ID: a pair made of the Object HASH and the Shard UUID containing the object.&lt;br /&gt;
* Global Index: a table mapping the Object HASH to the Shard UUID that contains the Object.&lt;br /&gt;
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.&lt;br /&gt;
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.&lt;br /&gt;
* Object Storage: the content of the Write Storage and the Read Storage combined.&lt;br /&gt;
* Database: [https://en.wikipedia.org/wiki/PostgreSQL PostgreSQL], [https://en.wikipedia.org/wiki/Apache_Cassandra Cassandra], etc.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Ceph_(software) Ceph]: a self-healing distributed storage.&lt;br /&gt;
* [https://docs.ceph.com/en/latest/rbd/ RBD] image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Total_cost_of_ownership TCO]: Total Cost of Ownership&lt;br /&gt;
&lt;br /&gt;
The key concepts are:&lt;br /&gt;
&lt;br /&gt;
* Packing millions of Objects together in Shards to:&lt;br /&gt;
** save space and,&lt;br /&gt;
** efficiently perform bulk actions such as mirroring or enumerations.&lt;br /&gt;
* Two different storage:&lt;br /&gt;
** Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,&lt;br /&gt;
** Write Storage from which Shards are created and moved to the Read Storage.&lt;br /&gt;
* Identifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.&lt;br /&gt;
&lt;br /&gt;
While the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.&lt;br /&gt;
&lt;br /&gt;
The content of the Object Storage (i.e., the Write Storage and the Read Storage combined) is '''strongly/strictly consistent'''. As soon as an Object is written (i.e., the write operation returns to the client), a reader can get the Object content from the Object Storage (with the caveat that it may require looking up the object from both the Write Storage and Read Storage).&lt;br /&gt;
&lt;br /&gt;
The Read Storage is '''eventually consistent'''. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, up to a given point in time.&lt;br /&gt;
&lt;br /&gt;
== Layer 0 (Object lookup require a complete Object ID) ==&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** A fixed number of Databases&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Shards implemented as Ceph RBD images named after their Shard UUID&lt;br /&gt;
** The content of the Shard uses a format that allows retrieving an Object in O(1) given the Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database is associated with a unique Shard UUID, chosen at random. All Objects written to a Database will be stored in the same Shard.&lt;br /&gt;
&lt;br /&gt;
A successful Object write returns the Object ID. Writing the same object twice may return different Object IDs. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different since it is chosen at random.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
When a Database grows bigger than a threshold (for instance 100GB), it stops accepting writes. A Shard is created in the Read Storage and Objects in the Database are sorted and copied to it. When the Shard is complete, the Database is deleted. Another Database is created, a new Shard UUID is allocated and it starts accepting writes.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
The Shard UUID is extracted from the Object ID. If a Shard exists in the Read Storage, the Object HASH is used to lookup the content of the Object. Otherwise the Database that owns the Shard UUID is looked up in the Write Storage and the Object HASH is used to lookup the content of the Object. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.&lt;br /&gt;
&lt;br /&gt;
== Layer 1 (Objects can be looked up using the Object HASH alone) ==&lt;br /&gt;
&lt;br /&gt;
A Global Index mapping the Object HASH of all known Objects to the Shard UUID is used to:&lt;br /&gt;
&lt;br /&gt;
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.&lt;br /&gt;
* deduplicate identical Objects based on their Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Write Storage and the Read Storage&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Read Storage&lt;br /&gt;
** Multiple readonly replicas of the Global Index of all known Objects in the Read Storage&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH exists in the Read Storage Global Index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage Global Index. There may be duplicate Objects in the Write Storage. It is expected that they race to be inserted in the Write Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
During packing, each Object HASH is looked up in the Read Storage Global Index. If it exists, the object is discarded. Otherwise its Object ID is added to the Read Storage Global Index. When packing is complete:&lt;br /&gt;
&lt;br /&gt;
* Readonly replicas of the Read Storage Global Index are updated with the newly added Object IDs.&lt;br /&gt;
* Object HASH that were found to be duplicate are updated in the Write Storage Global Index. The Object HASH is mapped to the Shard UUID retrieved from the Read Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH is found in the Read Storage Global Index, use the Shard UUID to read the Object content from the Shard found in the Read Storage. Otherwise lookup the Object HASH from the Write Storage Global Index and read the content of the Object from the Database that owns the Shard UUID.&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects save space? =&lt;br /&gt;
&lt;br /&gt;
The short answer is: it does not when Objects are big enough, but it does when there are a lot of small Objects.&lt;br /&gt;
&lt;br /&gt;
If there are billions of objects (i.e., less than one billion is not a lot) and 50% of them have a size smaller than 4KB and 75% of them have a size smaller than 16KB (i.e., bigger than 16KB is not small), then packing will save space.&lt;br /&gt;
&lt;br /&gt;
In the simplest method of packing (i.e., appending each Object after another in a file) and since the Object HASH has a fixed size, the only overhead for each object is the size of the Object (8 bytes). Assuming the Shard containing the Objects is handled as a single 100GB Ceph RBD Image, it adds R bytes. If the underlying Ceph pool is erasure coded k=4,m=2 an additional 50% must be added.&lt;br /&gt;
&lt;br /&gt;
Retrieving an Object from a Shard would be O(n) in this case because there is no index. It is more efficient to [https://en.wikipedia.org/wiki/Perfect_hash_function add a minimal hash table] to the Shard so that finding an object is O(1) instead. That optimization requires an additional 8 bytes per Object to store their offset, i.e. a total of 16 bytes per object.&lt;br /&gt;
&lt;br /&gt;
If Objects are not packed together, each of them requires at least B bytes, which is the minimum space overhead imposed by the underlying storage system. And an additional 50% for durability. The space used by Objects that are smaller than a given threshold will be amplified, depending on the underlying storage. For instance all objects in Ceph have a minimum size of 4KB, therefore the size of a 1KB Object will be amplified to 4KB which translates to a [https://forge.softwareheritage.org/T3052#58864 35% space amplification]. Another example is MinIO with [https://github.com/minio/minio/issues/7395#issuecomment-475161144 over 200% space amplification] or [https://wiki.openstack.org/wiki/Swift/ideas/small_files#Challenges Swift] for which [https://www.ovh.com/blog/dealing-with-small-files-with-openstack-swift-part-2/ packing small files was recently proposed].&lt;br /&gt;
&lt;br /&gt;
To summarize, the overhead of storing M Objects totaling S bytes with M=100 billions and S=10PB is:&lt;br /&gt;
&lt;br /&gt;
* '''packed:''' ~15.5PB&lt;br /&gt;
** (S / 100GB) * R == (10PB / 100GB) * R bytes = 10,000 * R bytes&lt;br /&gt;
** (M * 24) = 100G Objects * 24 bytes = 2.4TB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
* '''not packed:''' ~17.5PB based on the optimistic assumption that the storage system has a 25% space overhead for small files&lt;br /&gt;
** 25% for space amplification = 10PB * 0.25 = 2.5PB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects help with enumeration? =&lt;br /&gt;
&lt;br /&gt;
For mirroring or running an algorithm on all objects, they must be enumerated. If they are not packed together in any way, which is the case with MinIO or Swift, they must be looked up individually. When they are packed together (one million or more), the reader can download an entire Shard instead, saving the accumulated delay imposed by millions of individual lookup.&lt;br /&gt;
&lt;br /&gt;
If looking up an individual Object takes 10 milliseconds and Shards can be read at 100MB/s:&lt;br /&gt;
&lt;br /&gt;
* Getting 1 billion objects requires 10 millions seconds which is over 100 days.&lt;br /&gt;
* One billion objects is 1/10 of the current content of Software Heritage, i.e. ~75TB which can be transferred by reading the Shards in less than 10 days&lt;br /&gt;
&lt;br /&gt;
= Storage solutions and TCO =&lt;br /&gt;
&lt;br /&gt;
When looking for off-the-shelf solutions all options were considered, [https://forge.softwareheritage.org/T3107 including distributed file systems such as IPFs and more] and most of them were discarded because they had at least one blocker that could not be fixed (e.g. no feature to guarantee the durability of an object). In the end a few remained, either including the following features or with the possibility for a third party to contribute them back to the project:&lt;br /&gt;
&lt;br /&gt;
* **Scale* to 100 billions objects&lt;br /&gt;
* Provide object '''packing'''&lt;br /&gt;
* Provide detailed '''documentation''' and community support for system administrators operating the storage&lt;br /&gt;
* Be thoroughly '''tested''' before a stable release is published&lt;br /&gt;
* Be '''packaged''' for at least one well known distribution&lt;br /&gt;
* Have '''stable releases''' maintained for at least two years&lt;br /&gt;
* A sound approach to address '''security''' problems (CVE etc.)&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
! Name&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! Seaweedfs&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Scaling&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Packing&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Documentation&lt;br /&gt;
| Good&lt;br /&gt;
| Average&lt;br /&gt;
| Terse&lt;br /&gt;
| Good&lt;br /&gt;
| Good&lt;br /&gt;
| Terse&lt;br /&gt;
|-&lt;br /&gt;
| Tests&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
| Few&lt;br /&gt;
| Average&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
|-&lt;br /&gt;
| Packages&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Stable releases&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Security&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Does not have stable releases and testing ==&lt;br /&gt;
&lt;br /&gt;
The performance goals, size distribution and the number of objects in Software Heritage are similar to what is described in the 2010 article “[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf Finding a needle in Haystack: Facebook’s photo storage]” that motivated the implementation of [https://github.com/chrislusf/seaweedfs Seaweedfs] in 2013 or [https://github.com/linkedin/ambry Ambry], the object storage published in 2017 by LinkedIn to store and serve trillions of media objects in web companies.&lt;br /&gt;
&lt;br /&gt;
Contributing to Seaweedfs or Ambry so they can be deployed and maintained would require:&lt;br /&gt;
&lt;br /&gt;
* Creating packages for the target Operating System (e.g. Debian GNU/Linux), maintaining a repository to distribute them, upload them to the official distribution repository so that they are available in the next stable release (about two years from now)&lt;br /&gt;
* Creating Ansible roles or Puppet modules for deployment on multiple machines&lt;br /&gt;
* Improving the documentation with a configuration and architecture guide to deploy at scale&lt;br /&gt;
* Discuss with upstream to create stable releases, define their lifecycle and organize release management&lt;br /&gt;
* Establish a security team in charge of handling the CVE&lt;br /&gt;
* Setup and infrastructure and create the software for integration testing to be run before a stable release is published to reduce the risk of regressions or data loss. This is specially important because a significant part of the software is dedicated to data storage and replication: bugs can lead to data loss or corruption.&lt;br /&gt;
&lt;br /&gt;
== Does not provide object packing ==&lt;br /&gt;
&lt;br /&gt;
[https://min.io/ MinIO] and [https://docs.openstack.org/swift/latest/ Swift] suffer from a space amplification problem and they do not provide object packing. Although [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] (also known as RGW) stores objects in RocksDB instead of files, it also suffers from a space amplification problem and does not provide object packing.&lt;br /&gt;
&lt;br /&gt;
Contributing to RGW, MinIO or Swift to add object packing would require:&lt;br /&gt;
&lt;br /&gt;
* Creating a blueprint to modify the internals to add object packing&lt;br /&gt;
* Discuss with upstream to validate the blueprint&lt;br /&gt;
* Implement the blueprint and the associated tests&lt;br /&gt;
&lt;br /&gt;
== Does not scale ==&lt;br /&gt;
&lt;br /&gt;
[https://eos-web.web.cern.ch/eos-web/ EOS] is based on Ceph and architectured for packing large objects in [https://docs.ceph.com/en/latest/rbd/ RBD]. However, it is not designed to scale over a few billion objects. Contrary to Ambry, Swift and other similar solutions, it delegates storage to Ceph which make it easier to modify and release without risking data loss or corruption. Instead of modifying EOS to scale to 100 billions objects, it is more practical to:&lt;br /&gt;
&lt;br /&gt;
* Write an EOS alternative from scratch, using the same ideas and adding the desired scalability&lt;br /&gt;
* Package&lt;br /&gt;
* Document&lt;br /&gt;
* Test&lt;br /&gt;
* Publish stable releases&lt;br /&gt;
* Define a security policy&lt;br /&gt;
&lt;br /&gt;
== Estimating the TCO ==&lt;br /&gt;
&lt;br /&gt;
Since no solution can be used as is, some work must be done in each case and the effort it requires should be compared. It is however difficult because the nature of the effort is different. The following factors were considered and aggregated in a TCO estimate.&lt;br /&gt;
&lt;br /&gt;
* '''Data loss risk:''' if a bug in the work done implies the risk of losing data, it makes the work significantly more complicated. It is the case if packing must be implemented in the internals of an existing object storage such as Swift. It is also the case if an object storage does not have integration testing to verify upgrading to a newer version won’t lead to a regression, which is the case with Ambry. It is likely that the Ambry upstream has extensive integration testing but they are not published.&lt;br /&gt;
* '''Large codebase:''' a large codebase means modifying it (to implement packing) or distributing it (packaging and documentation) is more difficult&lt;br /&gt;
* '''Language:''' if the language and its environment is familiar to the developers and the system administrators, the work is less difficult&lt;br /&gt;
* '''Skills:''' if the work requires highly specialized skills (such as an intimate understanding of how a distributed storage system guarantees a strict consistency of the data, or running integration tests that require a cluster of machines) it is more difficult&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! Seaweedfs&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Data loss risk&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Large codebase&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Language&lt;br /&gt;
| C++&lt;br /&gt;
| Python&lt;br /&gt;
| Go&lt;br /&gt;
| Go&lt;br /&gt;
| Python&lt;br /&gt;
| Java&lt;br /&gt;
|-&lt;br /&gt;
| Skills&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|-&lt;br /&gt;
| TCO estimate&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In a nutshell, implementing an alternative to EOS from scratch has the lowest TCO estimate, primarily because it is independent of the underlying distributed storage.&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1514</id>
		<title>A practical approach to efficiently store 100 billions small objects in Ceph</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1514"/>
		<updated>2021-03-10T13:07:52Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Created page with &amp;quot;= A practical approach to efficiently store 100 billions small objects in Ceph =  The [https://en.wikipedia.org/wiki/Software_Heritage Software Heritage] project mission is to...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= A practical approach to efficiently store 100 billions small objects in Ceph =&lt;br /&gt;
&lt;br /&gt;
The [https://en.wikipedia.org/wiki/Software_Heritage Software Heritage] project mission is to collect, preserve and share all software that is available in source code form, with the goal of building a common, shared infrastructure at the service of industry, research, culture and society as a whole. As of February 2021 it contains 10 billions unique source code files (or “objects”, in the following) totaling ~750TB of (uncompressed) data and grows by 50TB every month. 75% of these objects have a size smaller than 16KB and 50% have a size smaller than 4KB. But these small objects only account for ~5% of the 750TB: 25% of the objects have a size &amp;amp;gt; 16KB and occupy ~700TB.&lt;br /&gt;
&lt;br /&gt;
The desired performances for '''10PB''' and '''100 billions objects''' are as follows:&lt;br /&gt;
&lt;br /&gt;
* The clients aggregated together can write at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* The clients aggregated together can read at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* There is no space amplification for small objects.&lt;br /&gt;
* Getting the first byte of any object never takes longer than 100ms.&lt;br /&gt;
* Objects can be enumerated in bulk, at least one million at a time.&lt;br /&gt;
* Mirroring the content of the Software Heritage archive can be done in bulk, at least one million objects at a time.&lt;br /&gt;
&lt;br /&gt;
Using an off-the-shelf object storage such as the [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] or [https://min.io/ MinIO] does not meet the requirements:&lt;br /&gt;
&lt;br /&gt;
* There is a significant space amplification for small objects: at least 25%, depending on the object storage (see “How does packing Objects save space?” below for details)&lt;br /&gt;
* Mirroring the content of the archive can only be done one object at a time and not in bulk which takes at least 10 times longer (see “How does packing Objects help with enumeration?” for details)&lt;br /&gt;
&lt;br /&gt;
A new solution must be implemented by re-using existing components and made available for system administrators to conveniently deploy and maintain in production. There are three ways to do that:&lt;br /&gt;
&lt;br /&gt;
* Contribute packaging and stable releases to a codebase such as [https://github.com/linkedin/ambry Ambry].&lt;br /&gt;
* Modify an object storage such as MinIO to support object packing.&lt;br /&gt;
* Get inspiration from an object storage design such as [https://eos-web.web.cern.ch/eos-web/ EOS] and implement something from scratch.&lt;br /&gt;
&lt;br /&gt;
For reasons explained below (see “Storage solutions and TCO”), it was decided to design a new object storage and implement it from scratch.&lt;br /&gt;
&lt;br /&gt;
= Proposed object storage design =&lt;br /&gt;
&lt;br /&gt;
In a nutshell, objects are written to databases running on a fixed number of machines (the Write Storage) that can vary to control the write throughput. When a threshold is reached (e.g. 100GB) all objects are put together in container (a Shard), and moved to a readonly storage that keeps expanding over time. After a successful write, a unique identifier (the Object ID) is returned to the client. It can be used to read the object back from the readonly storage. Reads scale out because the unique identifiers of the objects embed the name of the container (the Shard UUID). Writes also scales out because the Database is chosen randomly. This is the Layer 0.&lt;br /&gt;
&lt;br /&gt;
Clients that cannot keep track of the name of the container can rely on an API that relies on an index mapping all known objects signatures (the Object HASH below) to the name of the container where they can be found. Although this index prevents scaling out writes, the readonly storage can still scale out by multiplying copies of the index as needed. This is the Layer 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
                      Layer 0 scales out&lt;br /&gt;
&lt;br /&gt;
      +--- write op ----+               +--- read  op ----+&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
   Object &amp;amp;             |               |                 |&lt;br /&gt;
   Object HASH     Object ID         Object ID         Object&lt;br /&gt;
      |            Object HASH          |                 |&lt;br /&gt;
      v            Shard UUID           v                 ^&lt;br /&gt;
      |                 |               |                 |&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage --------+&lt;br /&gt;
    |                           |  |                          |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database |-&amp;gt;--Packing-&amp;gt;----&amp;gt; | Shard |      | Shard | |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database || Database |  |  | | Shard |      | Shard | |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    |                           |  | +-------+      +-------+ |&lt;br /&gt;
    +---------------------------+  | | Shard |      | Shard | |&lt;br /&gt;
                                   | +-------+      +-------+ |&lt;br /&gt;
                                   |            ...           |&lt;br /&gt;
                                   +--------------------------+&lt;br /&gt;
&lt;br /&gt;
                      Layer 1 reads scale out&lt;br /&gt;
&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage ---------+&lt;br /&gt;
    |                           |  |                           |&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    ||Object HASH to Shard UUID||  ||Object HASH to Shard UUID||&lt;br /&gt;
    ||        index            |&amp;gt;&amp;gt;&amp;gt;&amp;gt;|        index            ||&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    +---------------------------+  |+-------------------------+|&lt;br /&gt;
       |                 |         ||Object HASH to Shard UUID||&lt;br /&gt;
       ^                 v         ||        index            ||&lt;br /&gt;
       |                 |         |+-------------------------+|&lt;br /&gt;
       ^                 v         |          ...              |&lt;br /&gt;
     Object              |         +---------------------------+&lt;br /&gt;
   Object HASH           v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |            Object HASH        Object&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       +--- write op ----+                +--- read  op ----+&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* Object: an opaque sequence of bytes.&lt;br /&gt;
* Object HASH: the hash of an Object, e.g., the checksum part of a [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers SWHID].&lt;br /&gt;
* Shard: a group of Objects, used to partition the full set of objects into manageable subsets.&lt;br /&gt;
* Shard UUID: the unique identifier of a Shard, as a [https://en.wikipedia.org/wiki/Universally_unique_identifier UUID].&lt;br /&gt;
* Object ID: a pair made of the Object HASH and the Shard UUID containing the object.&lt;br /&gt;
* Global Index: a table mapping the Object HASH to the Shard UUID that contains the Object.&lt;br /&gt;
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.&lt;br /&gt;
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.&lt;br /&gt;
* Object Storage: the content of the Write Storage and the Read Storage combined.&lt;br /&gt;
* Database: [https://en.wikipedia.org/wiki/PostgreSQL PostgreSQL], [https://en.wikipedia.org/wiki/Apache_Cassandra Cassandra], etc.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Ceph_(software) Ceph]: a self-healing distributed storage.&lt;br /&gt;
* [https://docs.ceph.com/en/latest/rbd/ RBD] image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Total_cost_of_ownership TCO]: Total Cost of Ownership&lt;br /&gt;
&lt;br /&gt;
The key concepts are:&lt;br /&gt;
&lt;br /&gt;
* Packing millions of Objects together in Shards to:&lt;br /&gt;
** save space and,&lt;br /&gt;
** efficiently perform bulk actions such as mirroring or enumerations.&lt;br /&gt;
* Two different storage:&lt;br /&gt;
** Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,&lt;br /&gt;
** Write Storage from which Shards are created and moved to the Read Storage.&lt;br /&gt;
* Identifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.&lt;br /&gt;
&lt;br /&gt;
While the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.&lt;br /&gt;
&lt;br /&gt;
The content of the Object Storage (i.e., the Write Storage and the Read Storage combined) is '''strongly/strictly consistent'''. As soon as an Object is written (i.e., the write operation returns to the client), a reader can get the Object content from the Object Storage (with the caveat that it may require looking up the object from both the Write Storage and Read Storage).&lt;br /&gt;
&lt;br /&gt;
The Read Storage is '''eventually consistent'''. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, up to a given point in time.&lt;br /&gt;
&lt;br /&gt;
== Layer 0 (Object lookup require a complete Object ID) ==&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** A fixed number of Databases&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Shards implemented as Ceph RBD images named after their Shard UUID&lt;br /&gt;
** The content of the Shard uses a format that allows retrieving an Object in O(1) given the Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database is associated with a unique Shard UUID, chosen at random. All Objects written to a Database will be stored in the same Shard.&lt;br /&gt;
&lt;br /&gt;
A successful Object write returns the Object ID. Writing the same object twice may return different Object IDs. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different since it is chosen at random.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
When a Database grows bigger than a threshold (for instance 100GB), it stops accepting writes. A Shard is created in the Read Storage and Objects in the Database are sorted and copied to it. When the Shard is complete, the Database is deleted. Another Database is created, a new Shard UUID is allocated and it starts accepting writes.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
The Shard UUID is extracted from the Object ID. If a Shard exists in the Read Storage, the Object HASH is used to lookup the content of the Object. Otherwise the Database that owns the Shard UUID is looked up in the Write Storage and the Object HASH is used to lookup the content of the Object. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.&lt;br /&gt;
&lt;br /&gt;
== Layer 1 (Objects can be looked up using the Object HASH alone) ==&lt;br /&gt;
&lt;br /&gt;
A Global Index mapping the Object HASH of all known Objects to the Shard UUID is used to:&lt;br /&gt;
&lt;br /&gt;
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.&lt;br /&gt;
* deduplicate identical Objects based on their Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Write Storage and the Read Storage&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Read Storage&lt;br /&gt;
** Multiple readonly replicas of the Global Index of all known Objects in the Read Storage&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH exists in the Read Storage Global Index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage Global Index. There may be duplicate Objects in the Write Storage. It is expected that they race to be inserted in the Write Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
During packing, each Object HASH is looked up in the Read Storage Global Index. If it exists, the object is discarded. Otherwise its Object ID is added to the Read Storage Global Index. When packing is complete:&lt;br /&gt;
&lt;br /&gt;
* Readonly replicas of the Read Storage Global Index are updated with the newly added Object IDs.&lt;br /&gt;
* Object HASH that were found to be duplicate are updated in the Write Storage Global Index. The Object HASH is mapped to the Shard UUID retrieved from the Read Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH is found in the Read Storage Global Index, use the Shard UUID to read the Object content from the Shard found in the Read Storage. Otherwise lookup the Object HASH from the Write Storage Global Index and read the content of the Object from the Database that owns the Shard UUID.&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects save space? =&lt;br /&gt;
&lt;br /&gt;
The short answer is: it does not when Objects are big enough, but it does when there are a lot of small Objects.&lt;br /&gt;
&lt;br /&gt;
If there are billions of objects (i.e., less than one billion is not a lot) and 50% of them have a size smaller than 4KB and 75% of them have a size smaller than 16KB (i.e., bigger than 16KB is not small), then packing will save space.&lt;br /&gt;
&lt;br /&gt;
In the simplest method of packing (i.e., appending each Object after another in a file) and since the Object HASH has a fixed size, the only overhead for each object is the size of the Object (8 bytes). Assuming the Shard containing the Objects is handled as a single 100GB Ceph RBD Image, it adds R bytes. If the underlying Ceph pool is erasure coded k=4,m=2 an additional 50% must be added.&lt;br /&gt;
&lt;br /&gt;
Retrieving an Object from a Shard would be O(n) in this case because there is no index. It is more efficient to [https://en.wikipedia.org/wiki/Perfect_hash_function add a minimal hash table] to the Shard so that finding an object is O(1) instead. That optimization requires an additional 8 bytes per Object to store their offset, i.e. a total of 16 bytes per object.&lt;br /&gt;
&lt;br /&gt;
If Objects are not packed together, each of them requires at least B bytes, which is the minimum space overhead imposed by the underlying storage system. And an additional 50% for durability. The space used by Objects that are smaller than a given threshold will be amplified, depending on the underlying storage. For instance all objects in Ceph have a minimum size of 4KB, therefore the size of a 1KB Object will be amplified to 4KB which translates to a [https://forge.softwareheritage.org/T3052#58864 35% space amplification]. Another example is MinIO with [https://github.com/minio/minio/issues/7395#issuecomment-475161144 over 200% space amplification] or [https://wiki.openstack.org/wiki/Swift/ideas/small_files#Challenges Swift] for which [https://www.ovh.com/blog/dealing-with-small-files-with-openstack-swift-part-2/ packing small files was recently proposed].&lt;br /&gt;
&lt;br /&gt;
To summarize, the overhead of storing M Objects totaling S bytes with M=100 billions and S=10PB is:&lt;br /&gt;
&lt;br /&gt;
* '''packed:''' ~15.5PB&lt;br /&gt;
** (S / 100GB) * R == (10PB / 100GB) * R bytes = 10,000 * R bytes&lt;br /&gt;
** (M * 24) = 100G Objects * 24 bytes = 2.4TB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
* '''not packed:''' ~17.5PB based on the optimistic assumption that the storage system has a 25% space overhead for small files&lt;br /&gt;
** 25% for space amplification = 10PB * 0.25 = 2.5PB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects help with enumeration? =&lt;br /&gt;
&lt;br /&gt;
For mirroring or running an algorithm on all objects, they must be enumerated. If they are not packed together in any way, which is the case with MinIO or Swift, they must be looked up individually. When they are packed together (one million or more), the reader can download an entire Shard instead, saving the accumulated delay imposed by millions of individual lookup.&lt;br /&gt;
&lt;br /&gt;
If looking up an individual Object takes 10 milliseconds and Shards can be read at 100MB/s:&lt;br /&gt;
&lt;br /&gt;
* Getting 1 billion objects requires 10 millions seconds which is over 100 days.&lt;br /&gt;
* One billion objects is 1/10 of the current content of Software Heritage, i.e. ~75TB which can be transferred by reading the Shards in less than 10 days&lt;br /&gt;
&lt;br /&gt;
= Storage solutions and TCO =&lt;br /&gt;
&lt;br /&gt;
When looking for off-the-shelf solutions all options were considered, [https://forge.softwareheritage.org/T3107 including distributed file systems such as IPFs and more] and most of them were discarded because they had at least one blocker that could not be fixed (e.g. no feature to guarantee the durability of an object). In the end a few remained, either including the following features or with the possibility for a third party to contribute them back to the project:&lt;br /&gt;
&lt;br /&gt;
* **Scale* to 100 billions objects&lt;br /&gt;
* Provide object '''packing'''&lt;br /&gt;
* Provide detailed '''documentation''' and community support for system administrators operating the storage&lt;br /&gt;
* Be thoroughly '''tested''' before a stable release is published&lt;br /&gt;
* Be '''packaged''' for at least one well known distribution&lt;br /&gt;
* Have '''stable releases''' maintained for at least two years&lt;br /&gt;
* A sound approach to address '''security''' problems (CVE etc.)&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
! Name&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! Seaweedfs&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Scaling&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Packing&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Documentation&lt;br /&gt;
| Good&lt;br /&gt;
| Average&lt;br /&gt;
| Terse&lt;br /&gt;
| Good&lt;br /&gt;
| Good&lt;br /&gt;
| Terse&lt;br /&gt;
|-&lt;br /&gt;
| Tests&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
| Few&lt;br /&gt;
| Average&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
|-&lt;br /&gt;
| Packages&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Stable releases&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Security&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Does not have stable releases and testing ==&lt;br /&gt;
&lt;br /&gt;
The performance goals, size distribution and the number of objects in Software Heritage are similar to what is described in the 2010 article “[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf Finding a needle in Haystack: Facebook’s photo storage]” that motivated the implementation of [https://github.com/chrislusf/seaweedfs Seaweedfs] in 2013 or [https://github.com/linkedin/ambry Ambry], the object storage published in 2017 by LinkedIn to store and serve trillions of media objects in web companies.&lt;br /&gt;
&lt;br /&gt;
Contributing to Seaweedfs or Ambry so they can be deployed and maintained would require:&lt;br /&gt;
&lt;br /&gt;
* Creating packages for the target Operating System (e.g. Debian GNU/Linux), maintaining a repository to distribute them, upload them to the official distribution repository so that they are available in the next stable release (about two years from now)&lt;br /&gt;
* Creating Ansible roles or Puppet modules for deployment on multiple machines&lt;br /&gt;
* Improving the documentation with a configuration and architecture guide to deploy at scale&lt;br /&gt;
* Discuss with upstream to create stable releases, define their lifecycle and organize release management&lt;br /&gt;
* Establish a security team in charge of handling the CVE&lt;br /&gt;
* Setup and infrastructure and create the software for integration testing to be run before a stable release is published to reduce the risk of regressions or data loss. This is specially important because a significant part of the software is dedicated to data storage and replication: bugs can lead to data loss or corruption.&lt;br /&gt;
&lt;br /&gt;
== Does not provide object packing ==&lt;br /&gt;
&lt;br /&gt;
[https://min.io/ MinIO] and [https://docs.openstack.org/swift/latest/ Swift] suffer from a space amplification problem and they do not provide object packing. Although [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] (also known as RGW) stores objects in RocksDB instead of files, it also suffers from a space amplification problem and does not provide object packing.&lt;br /&gt;
&lt;br /&gt;
Contributing to RGW, MinIO or Swift to add object packing would require:&lt;br /&gt;
&lt;br /&gt;
* Creating a blueprint to modify the internals to add object packing&lt;br /&gt;
* Discuss with upstream to validate the blueprint&lt;br /&gt;
* Implement the blueprint and the associated tests&lt;br /&gt;
&lt;br /&gt;
== Does not scale ==&lt;br /&gt;
&lt;br /&gt;
[https://eos-web.web.cern.ch/eos-web/ EOS] is based on Ceph and architectured for packing large objects in [https://docs.ceph.com/en/latest/rbd/ RBD]. However, it is not designed to scale over a few billion objects. Contrary to Ambry, Swift and other similar solutions, it delegates storage to Ceph which make it easier to modify and release without risking data loss or corruption. Instead of modifying EOS to scale to 100 billions objects, it is more practical to:&lt;br /&gt;
&lt;br /&gt;
* Write an EOS alternative from scratch, using the same ideas and adding the desired scalability&lt;br /&gt;
* Package&lt;br /&gt;
* Document&lt;br /&gt;
* Test&lt;br /&gt;
* Publish stable releases&lt;br /&gt;
* Define a security policy&lt;br /&gt;
&lt;br /&gt;
== Estimating the TCO ==&lt;br /&gt;
&lt;br /&gt;
Since no solution can be used as is, some work must be done in each case and the effort it requires should be compared. It is however difficult because the nature of the effort is different. The following factors were considered and aggregated in a TCO estimate.&lt;br /&gt;
&lt;br /&gt;
* '''Data loss risk:''' if a bug in the work done implies the risk of losing data, it makes the work significantly more complicated. It is the case if packing must be implemented in the internals of an existing object storage such as Swift. It is also the case if an object storage does not have integration testing to verify upgrading to a newer version won’t lead to a regression, which is the case with Ambry. It is likely that the Ambry upstream has extensive integration testing but they are not published.&lt;br /&gt;
* '''Large codebase:''' a large codebase means modifying it (to implement packing) or distributing it (packaging and documentation) is more difficult&lt;br /&gt;
* '''Language:''' if the language and its environment is familiar to the developers and the system administrators, the work is less difficult&lt;br /&gt;
* '''Skills:''' if the work requires highly specialized skills (such as an intimate understanding of how a distributed storage system guarantees a strict consistency of the data, or running integration tests that require a cluster of machines) it is more difficult&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!&lt;br /&gt;
! RGW&lt;br /&gt;
! EOS&lt;br /&gt;
! Seaweedfs&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Data loss risk&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Large codebase&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Language&lt;br /&gt;
| C++&lt;br /&gt;
| Python&lt;br /&gt;
| Go&lt;br /&gt;
| Go&lt;br /&gt;
| Python&lt;br /&gt;
| Java&lt;br /&gt;
|-&lt;br /&gt;
| Skills&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|-&lt;br /&gt;
| TCO estimate&lt;br /&gt;
| High&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In a nutshell, implementing an alternative to EOS from scratch has the lowest TCO estimate, primarily because it is independent of the underlying distributed storage.&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Arcanist_setup&amp;diff=1398</id>
		<title>Arcanist setup</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Arcanist_setup&amp;diff=1398"/>
		<updated>2021-01-08T22:37:10Z</updated>

		<summary type="html">&lt;p&gt;Dachary: typo&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://secure.phabricator.com/book/phabricator/article/arcanist/ Arcanist] is a command line interface to [[Phabricator]].&lt;br /&gt;
This page details how to configure it for use with the [[Software Heritage]] forge.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# apt-get install arcanist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Configuration ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ arc set-config default https://forge.softwareheritage.org/&lt;br /&gt;
$ arc install-certificate&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
arc will prompt you to login into Phabricator via web (which will ask your personal Phabricator credentials).&lt;br /&gt;
You will then have to copy paste the API token from the web page to arc, and hit Enter to complete the certificate installation.&lt;br /&gt;
&lt;br /&gt;
'''All done!''' Now have a look at ''arc help'' and start hacking.&lt;br /&gt;
&lt;br /&gt;
=== Configuration file ===&lt;br /&gt;
&lt;br /&gt;
Arcanist configuration is stored in ''~/.arcrc''.&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
* [https://www.mediawiki.org/wiki/Phabricator/Arcanist#Setup Wikimedia guide to Arcanist setup]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Suggestion_box:_source_code_to_add&amp;diff=418</id>
		<title>Suggestion box: source code to add</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Suggestion_box:_source_code_to_add&amp;diff=418"/>
		<updated>2016-08-09T09:51:39Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Draft method to harvest wikidata&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [[Archive]] is growing organically. We started &amp;quot;small&amp;quot;, tracking 3 '''software origins''' (GitHub + Debian + GNU), and we will be adding new origins bit by bit, depending on the urgency of archiving them and available development energies to integrate them into Software Heritage.&lt;br /&gt;
&lt;br /&gt;
Using this page you can add suggestions of software origins that we aren't following yet, but we should. You can include information about who to contact for technical collaboration, the urgency of archival, and other useful information. To that end, just add a row to the table below. Here some information about the meaning of the various columns.&lt;br /&gt;
&lt;br /&gt;
Entries are currently listed simply in order of addition to this page; we will add more structure when the list will start growing.&lt;br /&gt;
&lt;br /&gt;
=== Legend ===&lt;br /&gt;
&lt;br /&gt;
;Software origin&lt;br /&gt;
: any (public accessible) &amp;quot;place&amp;quot; on the Internet that host software in source code form. Please provide a title for it and hyperlink it to the relevant URL&lt;br /&gt;
;Type of origin&lt;br /&gt;
: information about the kind of hosting, e.g., whether it is a forge, a collection of repositories, an homepage publishing tarball, or a one shot source code repository. For all kind of repositories please specify which VCS system is in use (Git, SVN, CVS, etc.)&lt;br /&gt;
;Contact&lt;br /&gt;
: who to contact for technical collaboration on how to best archive source code hosted on the software origin. You can list yourself if you're the relevant person, or provide the most relevant contact point if you know it&lt;br /&gt;
;Conservation status&lt;br /&gt;
: information about how likely it is that the software origin will disappear; high likelihood will make it more urgent for us to archive software hosted there. We suggest to use the [https://en.wikipedia.org/wiki/Conservation_status species conservation status], i.e., one of: Critically endangered (CR), Endangered (EN), Vulnerable (VU), Near threatened (NT), Least concern (LC).&lt;br /&gt;
;How to mirror&lt;br /&gt;
: (pointers to) technical information on how to do a full mirror of ''all'' the source code available at the software origin, ideally one shot and in batch&lt;br /&gt;
;How to keep up&lt;br /&gt;
: (pointers to) technical information on how to incrementally retrieve new source code accumulated since the last visit; usually this should be based on some kind of incremental change feed or event API&lt;br /&gt;
;Notes&lt;br /&gt;
: anything else you think we should know about this software origin&lt;br /&gt;
&lt;br /&gt;
== Suggestions ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
!Software origin&lt;br /&gt;
!Type of origin&lt;br /&gt;
!Contact&lt;br /&gt;
!Conservation status&lt;br /&gt;
!How to mirror&lt;br /&gt;
!How to keep up&lt;br /&gt;
!Notes&lt;br /&gt;
|-&lt;br /&gt;
|'''(sample entry)''' GitHubBub forge&lt;br /&gt;
|Git hosting&lt;br /&gt;
|John Doe &amp;lt;john@example.com&amp;gt;&lt;br /&gt;
|LC&lt;br /&gt;
|retrieve full repo list at /api/list, then git clone on each entry&lt;br /&gt;
|poll RSS feed at /api/updates?since=YYYY-MM-DD&lt;br /&gt;
|nothing special to add&lt;br /&gt;
|-&lt;br /&gt;
|[http://pauillac.inria.fr/~huet/cea.html Gérard Huet's seminal work on 3D]&lt;br /&gt;
|Scanned source code&lt;br /&gt;
|Gérard Huet &amp;lt;gerard.huet@inria.fr&amp;gt;&lt;br /&gt;
|EN&lt;br /&gt;
|retrieve listing images from the web pages&lt;br /&gt;
|N/A&lt;br /&gt;
|links are half broken, yquem should be replaced with pauillac everywhere it appears&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.gentoo.org/ Gentoo]&lt;br /&gt;
|&lt;br /&gt;
|Johannes Kellner &amp;lt;gentoo@johannes-kellner.eu&amp;gt;&lt;br /&gt;
|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://www.softwarepreservation.org/projects Software Preservation Project]&lt;br /&gt;
|Website with a collection of archives&lt;br /&gt;
|Paul McJones &amp;lt;paul@mcjones.org&amp;gt;&lt;br /&gt;
|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://code.nasa.gov/ 253 NASA open source software projects]&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://smaky.ch/ Smaky], the swiss micro-computer series&lt;br /&gt;
| [http://infini.smaky.ch/sources.html Partial code dump]&lt;br /&gt;
| [mailto:arnaud@epsitec.ch Pierre Arnaud] (current CEO of Epsitec) and/or [mailto:jean-daniel.nicoud@epfl.ch Jean-Daniel Nicoud] (founder of the computer series]&lt;br /&gt;
| EN&lt;br /&gt;
| Probably manually&lt;br /&gt;
| No new updates&lt;br /&gt;
| Some references to this history: [http://www.memoires-informatiques.org/ Fondation Mémoires Informatiques], [http://smaky.ch/ Smaky.ch] (in particular, [http://smaky.ch/theme.php?id=lami the short history]&lt;br /&gt;
|-&lt;br /&gt;
|[https://wiki.debian.org/Derivatives/Census all Debian derivatives]&lt;br /&gt;
|Debian-based distros&lt;br /&gt;
|Paul Wise &amp;lt;pabs@debian.org&amp;gt;&lt;br /&gt;
|varying, depending on the distro&lt;br /&gt;
|see [[Suggestion_box:_source_code_to_add/Debian_derivatives|details]]&lt;br /&gt;
|see [[Suggestion_box:_source_code_to_add/Debian_derivatives|details]]&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://sourceforge.net/ SourceForge]&lt;br /&gt;
|CVS, SVN, Mercurial, Git&lt;br /&gt;
|&lt;br /&gt;
|VU&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/FLOSS#Conservation_status_2 wikidata endangered software]&lt;br /&gt;
|depends on the '''source code repository''' property&lt;br /&gt;
|Loic Dachary &amp;lt;loic@dachary.org&amp;gt;&lt;br /&gt;
|The risk is higher than [https://www.wikidata.org/wiki/Property_talk:P141 LC]&lt;br /&gt;
|A script should obtain the '''source code repository''' property for the software and mirror it depending on the [https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/FLOSS#source_code_repository protocol] qualifier. If the '''source code repository''' is '''no value''', the [https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/Software#streaming_media_URL streaming media URL] of the '''preferred''' [https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/Software#software_version_.28P348.29 software version] should be downloaded instead.&lt;br /&gt;
|Once a copy is secured by software heritage, a URL to the software heritage repository should be added to the '''source code repository''' property and the '''conservation status''' property should be removed, meaning it is '''least concerned''' by default. The software will no longer show in the list of endangered software.&lt;br /&gt;
|This is work in progress, part of the [https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/FLOSS wikidata FLOSS project] and the scripts do not exist yet.&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Archive]]&lt;br /&gt;
[[Category:Suggestions]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Suggestion_box:_source_code_to_add&amp;diff=417</id>
		<title>Suggestion box: source code to add</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Suggestion_box:_source_code_to_add&amp;diff=417"/>
		<updated>2016-08-09T09:38:41Z</updated>

		<summary type="html">&lt;p&gt;Dachary: this would be very tedious to maintain on a per software basis&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [[Archive]] is growing organically. We started &amp;quot;small&amp;quot;, tracking 3 '''software origins''' (GitHub + Debian + GNU), and we will be adding new origins bit by bit, depending on the urgency of archiving them and available development energies to integrate them into Software Heritage.&lt;br /&gt;
&lt;br /&gt;
Using this page you can add suggestions of software origins that we aren't following yet, but we should. You can include information about who to contact for technical collaboration, the urgency of archival, and other useful information. To that end, just add a row to the table below. Here some information about the meaning of the various columns.&lt;br /&gt;
&lt;br /&gt;
Entries are currently listed simply in order of addition to this page; we will add more structure when the list will start growing.&lt;br /&gt;
&lt;br /&gt;
=== Legend ===&lt;br /&gt;
&lt;br /&gt;
;Software origin&lt;br /&gt;
: any (public accessible) &amp;quot;place&amp;quot; on the Internet that host software in source code form. Please provide a title for it and hyperlink it to the relevant URL&lt;br /&gt;
;Type of origin&lt;br /&gt;
: information about the kind of hosting, e.g., whether it is a forge, a collection of repositories, an homepage publishing tarball, or a one shot source code repository. For all kind of repositories please specify which VCS system is in use (Git, SVN, CVS, etc.)&lt;br /&gt;
;Contact&lt;br /&gt;
: who to contact for technical collaboration on how to best archive source code hosted on the software origin. You can list yourself if you're the relevant person, or provide the most relevant contact point if you know it&lt;br /&gt;
;Conservation status&lt;br /&gt;
: information about how likely it is that the software origin will disappear; high likelihood will make it more urgent for us to archive software hosted there. We suggest to use the [https://en.wikipedia.org/wiki/Conservation_status species conservation status], i.e., one of: Critically endangered (CR), Endangered (EN), Vulnerable (VU), Near threatened (NT), Least concern (LC).&lt;br /&gt;
;How to mirror&lt;br /&gt;
: (pointers to) technical information on how to do a full mirror of ''all'' the source code available at the software origin, ideally one shot and in batch&lt;br /&gt;
;How to keep up&lt;br /&gt;
: (pointers to) technical information on how to incrementally retrieve new source code accumulated since the last visit; usually this should be based on some kind of incremental change feed or event API&lt;br /&gt;
;Notes&lt;br /&gt;
: anything else you think we should know about this software origin&lt;br /&gt;
&lt;br /&gt;
== Suggestions ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
!Software origin&lt;br /&gt;
!Type of origin&lt;br /&gt;
!Contact&lt;br /&gt;
!Conservation status&lt;br /&gt;
!How to mirror&lt;br /&gt;
!How to keep up&lt;br /&gt;
!Notes&lt;br /&gt;
|-&lt;br /&gt;
|'''(sample entry)''' GitHubBub forge&lt;br /&gt;
|Git hosting&lt;br /&gt;
|John Doe &amp;lt;john@example.com&amp;gt;&lt;br /&gt;
|LC&lt;br /&gt;
|retrieve full repo list at /api/list, then git clone on each entry&lt;br /&gt;
|poll RSS feed at /api/updates?since=YYYY-MM-DD&lt;br /&gt;
|nothing special to add&lt;br /&gt;
|-&lt;br /&gt;
|[http://pauillac.inria.fr/~huet/cea.html Gérard Huet's seminal work on 3D]&lt;br /&gt;
|Scanned source code&lt;br /&gt;
|Gérard Huet &amp;lt;gerard.huet@inria.fr&amp;gt;&lt;br /&gt;
|EN&lt;br /&gt;
|retrieve listing images from the web pages&lt;br /&gt;
|N/A&lt;br /&gt;
|links are half broken, yquem should be replaced with pauillac everywhere it appears&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.gentoo.org/ Gentoo]&lt;br /&gt;
|&lt;br /&gt;
|Johannes Kellner &amp;lt;gentoo@johannes-kellner.eu&amp;gt;&lt;br /&gt;
|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://www.softwarepreservation.org/projects Software Preservation Project]&lt;br /&gt;
|Website with a collection of archives&lt;br /&gt;
|Paul McJones &amp;lt;paul@mcjones.org&amp;gt;&lt;br /&gt;
|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://code.nasa.gov/ 253 NASA open source software projects]&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://smaky.ch/ Smaky], the swiss micro-computer series&lt;br /&gt;
| [http://infini.smaky.ch/sources.html Partial code dump]&lt;br /&gt;
| [mailto:arnaud@epsitec.ch Pierre Arnaud] (current CEO of Epsitec) and/or [mailto:jean-daniel.nicoud@epfl.ch Jean-Daniel Nicoud] (founder of the computer series]&lt;br /&gt;
| EN&lt;br /&gt;
| Probably manually&lt;br /&gt;
| No new updates&lt;br /&gt;
| Some references to this history: [http://www.memoires-informatiques.org/ Fondation Mémoires Informatiques], [http://smaky.ch/ Smaky.ch] (in particular, [http://smaky.ch/theme.php?id=lami the short history]&lt;br /&gt;
|-&lt;br /&gt;
|[https://wiki.debian.org/Derivatives/Census all Debian derivatives]&lt;br /&gt;
|Debian-based distros&lt;br /&gt;
|Paul Wise &amp;lt;pabs@debian.org&amp;gt;&lt;br /&gt;
|varying, depending on the distro&lt;br /&gt;
|see [[Suggestion_box:_source_code_to_add/Debian_derivatives|details]]&lt;br /&gt;
|see [[Suggestion_box:_source_code_to_add/Debian_derivatives|details]]&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://sourceforge.net/ SourceForge]&lt;br /&gt;
|CVS, SVN, Mercurial, Git&lt;br /&gt;
|&lt;br /&gt;
|VU&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Archive]]&lt;br /&gt;
[[Category:Suggestions]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Suggestion_box:_source_code_to_add&amp;diff=416</id>
		<title>Suggestion box: source code to add</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Suggestion_box:_source_code_to_add&amp;diff=416"/>
		<updated>2016-08-09T08:49:07Z</updated>

		<summary type="html">&lt;p&gt;Dachary: Add KSEG&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [[Archive]] is growing organically. We started &amp;quot;small&amp;quot;, tracking 3 '''software origins''' (GitHub + Debian + GNU), and we will be adding new origins bit by bit, depending on the urgency of archiving them and available development energies to integrate them into Software Heritage.&lt;br /&gt;
&lt;br /&gt;
Using this page you can add suggestions of software origins that we aren't following yet, but we should. You can include information about who to contact for technical collaboration, the urgency of archival, and other useful information. To that end, just add a row to the table below. Here some information about the meaning of the various columns.&lt;br /&gt;
&lt;br /&gt;
Entries are currently listed simply in order of addition to this page; we will add more structure when the list will start growing.&lt;br /&gt;
&lt;br /&gt;
=== Legend ===&lt;br /&gt;
&lt;br /&gt;
;Software origin&lt;br /&gt;
: any (public accessible) &amp;quot;place&amp;quot; on the Internet that host software in source code form. Please provide a title for it and hyperlink it to the relevant URL&lt;br /&gt;
;Type of origin&lt;br /&gt;
: information about the kind of hosting, e.g., whether it is a forge, a collection of repositories, an homepage publishing tarball, or a one shot source code repository. For all kind of repositories please specify which VCS system is in use (Git, SVN, CVS, etc.)&lt;br /&gt;
;Contact&lt;br /&gt;
: who to contact for technical collaboration on how to best archive source code hosted on the software origin. You can list yourself if you're the relevant person, or provide the most relevant contact point if you know it&lt;br /&gt;
;Conservation status&lt;br /&gt;
: information about how likely it is that the software origin will disappear; high likelihood will make it more urgent for us to archive software hosted there. We suggest to use the [https://en.wikipedia.org/wiki/Conservation_status species conservation status], i.e., one of: Critically endangered (CR), Endangered (EN), Vulnerable (VU), Near threatened (NT), Least concern (LC).&lt;br /&gt;
;How to mirror&lt;br /&gt;
: (pointers to) technical information on how to do a full mirror of ''all'' the source code available at the software origin, ideally one shot and in batch&lt;br /&gt;
;How to keep up&lt;br /&gt;
: (pointers to) technical information on how to incrementally retrieve new source code accumulated since the last visit; usually this should be based on some kind of incremental change feed or event API&lt;br /&gt;
;Notes&lt;br /&gt;
: anything else you think we should know about this software origin&lt;br /&gt;
&lt;br /&gt;
== Suggestions ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
!Software origin&lt;br /&gt;
!Type of origin&lt;br /&gt;
!Contact&lt;br /&gt;
!Conservation status&lt;br /&gt;
!How to mirror&lt;br /&gt;
!How to keep up&lt;br /&gt;
!Notes&lt;br /&gt;
|-&lt;br /&gt;
|'''(sample entry)''' GitHubBub forge&lt;br /&gt;
|Git hosting&lt;br /&gt;
|John Doe &amp;lt;john@example.com&amp;gt;&lt;br /&gt;
|LC&lt;br /&gt;
|retrieve full repo list at /api/list, then git clone on each entry&lt;br /&gt;
|poll RSS feed at /api/updates?since=YYYY-MM-DD&lt;br /&gt;
|nothing special to add&lt;br /&gt;
|-&lt;br /&gt;
|[http://pauillac.inria.fr/~huet/cea.html Gérard Huet's seminal work on 3D]&lt;br /&gt;
|Scanned source code&lt;br /&gt;
|Gérard Huet &amp;lt;gerard.huet@inria.fr&amp;gt;&lt;br /&gt;
|EN&lt;br /&gt;
|retrieve listing images from the web pages&lt;br /&gt;
|N/A&lt;br /&gt;
|links are half broken, yquem should be replaced with pauillac everywhere it appears&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.gentoo.org/ Gentoo]&lt;br /&gt;
|&lt;br /&gt;
|Johannes Kellner &amp;lt;gentoo@johannes-kellner.eu&amp;gt;&lt;br /&gt;
|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://www.softwarepreservation.org/projects Software Preservation Project]&lt;br /&gt;
|Website with a collection of archives&lt;br /&gt;
|Paul McJones &amp;lt;paul@mcjones.org&amp;gt;&lt;br /&gt;
|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://code.nasa.gov/ 253 NASA open source software projects]&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://smaky.ch/ Smaky], the swiss micro-computer series&lt;br /&gt;
| [http://infini.smaky.ch/sources.html Partial code dump]&lt;br /&gt;
| [mailto:arnaud@epsitec.ch Pierre Arnaud] (current CEO of Epsitec) and/or [mailto:jean-daniel.nicoud@epfl.ch Jean-Daniel Nicoud] (founder of the computer series]&lt;br /&gt;
| EN&lt;br /&gt;
| Probably manually&lt;br /&gt;
| No new updates&lt;br /&gt;
| Some references to this history: [http://www.memoires-informatiques.org/ Fondation Mémoires Informatiques], [http://smaky.ch/ Smaky.ch] (in particular, [http://smaky.ch/theme.php?id=lami the short history]&lt;br /&gt;
|-&lt;br /&gt;
|[https://wiki.debian.org/Derivatives/Census all Debian derivatives]&lt;br /&gt;
|Debian-based distros&lt;br /&gt;
|Paul Wise &amp;lt;pabs@debian.org&amp;gt;&lt;br /&gt;
|varying, depending on the distro&lt;br /&gt;
|see [[Suggestion_box:_source_code_to_add/Debian_derivatives|details]]&lt;br /&gt;
|see [[Suggestion_box:_source_code_to_add/Debian_derivatives|details]]&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://sourceforge.net/ SourceForge]&lt;br /&gt;
|CVS, SVN, Mercurial, Git&lt;br /&gt;
|&lt;br /&gt;
|VU&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.wikidata.org/wiki/Q846869 KSEG]&lt;br /&gt;
|http://www.mit.edu/~ibaran/kseg-0.403.tar.gz&lt;br /&gt;
|&lt;br /&gt;
|VU&lt;br /&gt;
|http GET&lt;br /&gt;
|N/A&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Archive]]&lt;br /&gt;
[[Category:Suggestions]]&lt;/div&gt;</summary>
		<author><name>Dachary</name></author>
	</entry>
</feed>