Reading Summary 2019.05.29

· by Aldrin Montana

Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently

Overall Evaluation

Maelstrom is a framework for defining tasks and “runbooks” (set of tasks) for partially automating datacenter-level actions such as choosing which datacenter to route requests to from edge (PoP) servers and how to distribute load across clusters (set of nodes in a datacenter). By controlling load balancing from edge servers to data centers (via edge weights) and load balancing between clusters within a data center (via cluster weights), Maelstrom is able to help operators perform high-level tasks (e.g. data center transitions, server transitions, test network partition recovery) necessary for testing and disaster recovery.

I would accept this because I believe there are interesting directions for: (1) interaction between this type of tool and service tests, and (2) applying ideas of programmable policies (see programmable storage) and new, composable mechanisms to a maelstrom-like framework.

Strong Points

The effective use of two mechanisms–edge load balancers and cluster load balancers–to support high-level tasks that not only make work easier for operators but make it possible for operators to more easily partially automate new tasks, test data center-level infrastructure, and even reuse each other’s work.
I think from a data center perspective and an infrastructure perspective, the evaluation seems very good and the generated graphs seem capable of helping operators do tests and monitor health effectively. I could even imagine a scenario where this data is made available for service regression or integration tests to query so that changes or new features can ensure that there is no unexpected behavior in specific fault-tolerance scenarios or consensus protocols.

Weak Points

I think the organization could be much better to appropriately convey the information. Specifically, the background section is actually a description of the mechanisms used, while “Maelstrom Overview” is a description of policy components. Then, the “Design and Implementation” section describes how to create policies for operators to use and how these policies use the available mechanisms. This would make 4.1 almost redundant, and all subsections in section 4 fit into a clean mental model much quicker for the reader.
The connection between changes made using Maelstrom to application or service specific errors or results should be much stronger. There is mention of how Maelstrom tasks affect web traffic (Figure 6), but there is no explanation of how operators would be able to understand these affects and adjust for them. Are their only options to revert changes, or keep changes? Change order of task operations and hope that it fixes the problem? These seem like unprincipled approaches, yet no mention is made of how Maelstrom could do minimal, or sampled, packet inspection to make better decisions.

Questions Raised

Is it possible for maelstrom to access service-level information in the form of generated configurations or an inspection-type API in order to provide better feedback to operators about the effects of maelstrom tasks on services other than graphs of low-layer network behavior. This question is related to the 2nd weak point mentioned above.