Reading Summary 2019-05-20

· by Aldrin Montana

Cudele: An API and Framework for Programmable Consistency and Durability in a Global Namespace

Problem Statement

The authors are trying to address the problem of using the POSIX API in HPC systems. This is a problem because the POSIX API inherently requires synchronization and serialization for metadata accesses. However, this paper addresses the problem indirectly by first addressing how the consistency guarantees of the POSIX API are not always necessary and how they can be programmatically or dynamically tuned for a given workload.

All related work are file systems used in HPC systems: BatchFS, DeltaFS, IndexFS, MarFS, and TwoTier. These file systems all attempt to de-couple the namespace from the data in the storage systems in some ways. While these address some performance concerns, the trade-offs are: (1) merging metadata state into the global namespace is slow, and can require specific consistency management, (2) failing clients can lead to loss of metadata operations spanning decoupled namespaces, (3) backwards compatibility may be lost, which would require applications to link against new storage system interfaces which may be difficult for legacy applications and is certainly undesirable overhead for any application.

Proposed Solution

The authors’ approach is to reuse mechanisms in Ceph (RPCs, stream, local persist, append client journal, volatile apply, nonvolatile apply, global persist) and to provide programmable policies that allow clients to tune consistency guarantees to the needs of application workloads. These policies can also differ between partitions, or subtrees, in the namespace.

Comments and Questions

  1. I had looked into varying the consistency levels of Ceph before, mainly towards the goal of allowing developers to use the various levels for their needs. I think it would be cool to see how well such programming models could mix with programmable storage.

  2. What is the difference between separating namespaces and creating multiple mountpoints that talk to the same backend storage system? For a non-networked file system I could see a large difference, but I’m not sure what the difference is when you can have multiple mount points interface with the same storage.