- TODO: expand this section to be less handwavey.
Ceph is a massively scale-out distributed object storage system. It provides transparent replication of objects across disks and storage nodes, load balancing for reads, and a set of features that would be very useful in the context of Software Heritage (for instance, the possibility to federate several clusters distributed on several data centers).
Ceph provides a low-level object storage layer through the RADOS library, as well as more high level APIs such as a Swift/S3 compatibility layer, a block device storage layer, as well as a filesystem layer. Considering the Software Heritage abstraction level, the RADOS object storage layer should be sufficient for our needs.
We're investigating using ceph as an object store for the Software Heritage data.
Dimensioning a Ceph cluster
Ceph clusters use two components: monitors and OSDs.
Machine types and jargon
A node is a physical machine, that each can run one or more of the ceph storage daemons
A rack is a set of nodes interconnected with an access and a replication network
A datacenter is a set of racks interconnected with their access and replication networks.
OSDs (Object Storage Daemons) are the machines whose role is to store the actual data. You would typically set up a machine with a bunch of disks, create one partition on each disk and run one OSD on each of those partitions. You do not need to set up OS-level redundancy as the Ceph redundancy settings will handle everything, from disk failure up to datacenter failure, in a transparent way, rebalancing and re-replicating data as needed to keep the replication constraints satisfied.
Monitors keep the state of the cluster and orchestrate the OSDs. There can (and should !) be several monitors per cluster to be able to tolerate failures at all levels of node distribution. However, monitors aren't central machines to the cluster as reads and writes are directly dispatched from the clients to the OSDs.
For this cluster dimensioning, we aim for 800TB usable storage, with every object is replicated 3 times: capacity to withstand the catastrophic loss of two nodes simultaneously. The best practice is to keep the cluster at around 70% utilization, and to scale it out by adding nodes regularly when usage increases.
Note that this approach also works while bootstrapping the cluster: you only really need three nodes (one monitor and two OSDs) to bootstrap the cluster, and then you can add machines as you go.
OSDs need about 100MB RAM / TB of storage in typical cases, but extreme situations can require up to 2GB. We therefore should consider 2GB RAM / 1TB storage. CPU on those machines is not relevant (well, they need one).
Here are some reasonable options considering Dell offerings, and the disks we already have available:
|Base server||Chassis settings||RAM||Number of disks bought||Max storage capacity (with recycling)||Price||Number needed for 800TB capacity|
|PowerEdge R730xd||12 x 3.5" disks||192 GB (max 96TB disks)||6 x 8TB 3.5" (= 48 TB)||6 x 8TB + 6 x 6TB (= 84 TB)||6994 EUR HT||25|
|PowerEdge R730xd||24 x 2.5" disks||128 GB (max 64TB disks)||12 x 2TB 2.5" (= 24 TB)||add 12 x 2TB (= 48 TB)||5500 EUR HT||50|
Total: ~450 KEUR
Assuming linear scaling in both redundancy and size (TO BE VERIFIED, including disk reuse considerations):
- 800 TB with 2 copies: ~300 KEUR
- 300 TB with 3 copies: ~168 KEUR
- 300 TB with 2 copies: ~110 KEUR