Ceph: A Scalable, High-Performance Distributed File System

What is the problem the authors are trying to solve?

The authors are trying to decouple data and metadata management and to distribute, and scale, many operations that previously were consolidated in single servers (e.g. GFS master server).

What other approaches or solutions existed at the time that this work was done? What was wrong with the other approaches or solutions?

The related work discussed in the paper includes OceanStore, Farsite, Vesta, Galley, PVFS, Swift, GPFS, StorageTank, LegionFS, GFS, Sorrento, Federated Array of Bricks (FAB), pNFS, Lustre, Panasas file system, zFS, and Kybos. All of these related projects can be grouped as follows:

large-scale systems - OceanStore, Farsite
parallel systems - Vesta, Galley, PVFS, Swift
decoupled metadata and data management - GPFS, StorageTank
grid-based systems - LegionFS
workload-optimized - GFS, Sorrento
NAS systems - FAB, pNFS
object-based storage - Lustre, Panasas, zFS, Sorrento, Kybos

Ceph is object-based and so similar to the object-based storage systems, however it also decouples metadata and data management similarly (but much moreso and better than) GPFS and StorageTank. By following these two primary paradigms, Ceph is able to achieve scalability for general workloads and flexible cluster topologies compared to all of the aforementioned related works.

What is the authors’ approach or solution? Why is it better than the other approaches or solutions? How does it perform?

Ceph’s overall design is centered around a few key designs or components: (1) CRUSH for consistent and scalable assignment of objects to placement groups, (2) a few layers of indirection to decouple metadata and data management–placement groups and object storage devices, (3) Object storage devices can have more responsibility than just data persistence, and they can do these extra operations efficiently.

The CRUSH algorithm is able to handle dynamic cluster topologies, and thus allow placement groups to scale without needing strong consistency. This is an improvement over other work like GFS where the master server was responsible for namespace management. Although it’s possible to vertically scale the master server, it necessarily has a limit to how performant it can be and reliable it can be, especially for a general use case.

The separation of metadata management into monitor servers (which manage placement groups) and data management into object storage devices allows metadata to be available and efficient, while object storage devices are able to flexibly exchange data objects between each other and with clients, enabling high-performance similarly to NASD and GFS, but more reliably than other related work.

Why is this work important?

This work is important because it is one of the first (maybe the first?) distributed storage system that provides scalability for general use cases, but also combines consistent hashing to enable scalable metadata management which is a key feature many other systems didn’t have.

3+ comments/questions

I am still interested in seeing if it’s possible to adapt the consistency model used in ceph to match particular workloads based on hints provided in application logic (http://research.aldrinmontana.com/blog/mixing-consistency/coalesced/diversifying-consistency-in-ceph)
I have started working with microservice-based distributed systems that use sidecar proxies for communication and I think it could be cool to see if the overhead of using such proxies in ceph would be an acceptable trade-off for gaining something akin to testing of other consistency protocols to see if there are other replication approaches that can further improve efficiency.
It would be interesting to try using the pipelined tcp transfer through OSD replicas within a placement group, which I don’t think is used. Each replica could ack at the end of the data transfer that it was correctly done, and if at least 2 replicas have ack’d, then ensuring that other replicas have correct data can be resolved between the current replicas and the possibly stale replicas.
On the topic of the HDF5 project, it could be interesting to more closely integrate access library semantics in the monitor server in a way that group and dataset topologies (dataset shapes) can be managed at the namespace or metadata level, which maintains the decoupling between data and metadata for file formats such as HDF5.