Reading Summary 2019-05-20

· by Aldrin Montana

Mantle: A Programmable Metadata Load Balancer for the Ceph File System

Problem Statement

The authors are trying to address: when, where, and what quantity of resources to move. The preferred approach to doing this is repurpose Ceph’s metadata balancer, but this has a secondary problem of being difficult to extend to a purpose it was not designed for. This secondary problem is addressed by the authors as, how to move resources.

Mentioned related work focus on mechanisms for moving metadata (how to move) and are grouped into 3 categories: mapping resources to servers via hash functions, table or index-based mappings, assignment of hierarchies or subtree partitions of resources to servers.

Hashing mechanisms distribute metadata evenly across servers and often provide dynamic rebalancing of keys. Hashes are typically computed on filenames or pathnames. However, mapping of resources to various servers are typically still centralized and can remain a bottleneck. Projects that explore, or utilize, these approaches are: PVFSv2, SkyFS, CalvinFS, and GPFS. Projects that attempt to allow distributed management of resources are: GIGA+ and IndexFS. These projects allow new files or directories to talk directly to servers that manage parent resources. However, all of these approaches are unable to leverage both temporal and spatial locality of metadata and workloads. Table-based mechanisms have these same problems.

Subtree partitioning mechanisms are able to leverage locality, distributed management of resource mappings to servers, and even load distribution. However, these mechanisms require careful planning and are often statically configured at setup. Projects that use this mechanism, Ursa Minor and Farsite, are limited in their ability to adapt to spikes in workloads and maintain balance of server load over time. Panasas allows addition of new subtrees during runtime, but this still does not address adaptation to workloads.

Proposed Solution

The authors’ solution is to provide an API that allows for varied and expressive policies over the metadata load balancer instead of the tightly coupled policies that work well for some workloads but not for the types of workloads studied in this paper. The API provided focuses on 4 clauses or components of a policy: load calculation, guard for when to migrate or rebalance metadata, load is represented across a list of targets to be loaded and the values can be adjusted as desired, finally a “selector” is provided that specifies how to choose resources to be moved until the target load is met. For how resources are chosen, some examples are big_first, meaning largest directory fragment first, and big_small, meaning alternation between largest and smallest directory fragments.

Mantle, the implementation of the proposed solution, performs well in some cases and poorly in others. When spilling metadata management to a 2nd MDS, improvements in throughput can be seen. When spilling to 3 or 4 MDSs, performance degradations can be seen for greedy spill (eager and even spilling of resources). Typically, spill and fill always improved performance, and it is shown that “too aggressive” spilling is better than “conservative” but it appears that “aggressive” is best overall. In any case, improvements were between 10% and 25%, which I think is a good amount of improvement.

Contribution

This work shows another aspect of how programmable storage can be a great way to bring new interfaces to existing storage systems, but with minimal modifications. This paper also shows that while separation of policy and mechanism is known to be a good thing, there are ways of further exploring this space to improve performance but also to explore the extent to which existing systems can be extended to have programmable policies that may open up many research opportunities.

Comments and Questions

I think that programmable storage is making a clear foundation on which auto-tuning, or dynamic data-driven decisions for policies, can be the next step in improving performance and simplifying management of large infrastructures such as storage systems.
I would like to see programmable policies for collecting performance metrics. My understanding is that collecing performance numbers is typically a matter of turning on metrics and then doing some analysis on the results, but I feel that there may be some benefit to policies that allow for sampling or collection of variable regions or functions or perhaps even streaming calculations to reduce the amount of data that needs to be stored and maintained.
I don’t think the paper discussed this, but I wonder if it’s possible to use metadata balancing as a mechanism for balancing analysis workloads as well. In the sense that zlog uses metadata, such as inodes, as a mechanism for implementing corfu in ceph, can balancing of inodes or partitioning of inodes be a way of managing multiple counters? This may not be useful, but it seems reasonable to ask if programmable policies lend themselves towards composability or if mechanisms can and should be composed and a rich programmable policy be provided over the composed mechanisms.