Reading Summary 2019-04-30
· by Aldrin Montana
CORFU - A Shared Log Design for Flash Clusters
The authors are trying to design a distributed, shared log that can be used by any distributed application that requires strong consistency.
The authors mention that they heavily borrow architecture from FAWN and design of CORFU was primarily to address issues with Hyder. In these ways, it seems like CORFU is primarily an implementation of Hyder with a de-coupling of obtaining log positions and writing data to the log, using a very lightweight “hole-filling” approach, and the use of projections to map log positions to flash devices (likely inspired by virtual memory). Then, all of this design was done on top of a FAWN-like architecture in order to provide a log service to applications in a data center.
The authors propose CORFU which organizes a cluster of flash devices into a shared log. CORFU is designed for network-attached flash devices and so it leverages the performance characteristics of flash, it tries to distribute wear-levelling, and it is able to be more cost efficient by providing log functions without storage servers (because it works over network-attached devices).
Correctness is ensured by a 2-step reconfiguration protocol: (1) seal current projection, (2) write new projection. Projections provide a coarse granularity for maintaining consistency between clients, and also for isolating log position ranges that need to be recovered from ones that are still service-able.
The main performance bottleneck is in finding the tail position of the log, which is improved with the use of a dedicated sequencer, which does not solve contention problems but optimizes for the common case.
Performance seems relatively good, though there is no baseline against which to understand the performance. There also does not seem to be an evaluation that significantly exercises write contention, though reconfiguration appears to be plenty efficient to me. Reconfiguration can happen frequently and so it needs to be as efficient as possible, and with as many as 32 flash devices sealing latency is no more than 6 ms (most of the time) and total latency is less than 40. This also appears to translate to ~3 second delay in throughput, but overall throughput does not need to ramp back up when a reconfiguration occurs. That being said, the provided evaluation for state machine replication may look good, but I am not convinced that this is something that databases would be interested in using.
It appears to be very early work of implementing a distributed, strongly consistent primitive (log) on top of clustered flash devices.
Comments and Questions
The use of reconfiguration to handle adding a new flash device to replace a failed flash device seemed very interesting and smart. I feel that the overall design of projections is similar to virtual memory, and so I wonder if a similar mechanism was first proposed for shared memory systems that use virtual memory to span multiple memory units.
I don’t think databases would be too excited to use CORFU as a log primitive, though I feel that the dedicated network sequencer would be a useful component to re-use. I have read Noah’s work on zlog, but I don’t think I was able to understand it as well. Now I want to re-read it to see if part of the contribution of zlog and programmable storage is to enable access to a network sequencer and finding the tail position of a log, rather than how data is written to the log.