Reading Summary 2019.05.24
· by Aldrin Montana
Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes
Amazon Aurora is a relational database offered by AWS that provides a relational front-end and kernel (MySQL or PostgreSQL) on a highly available, causally consistent (my interpretation) storage system (S3). In order to avoid network amplification and improve throughput, Aurora does not use consensus for transaction commits or acknowledgements. A write is viewed as a redo log record (traditional) and is replicated to storage nodes asynchronously, which ack asynchronously once it is made durable. Then, separately redo log records are sorted and propagated to other storage nodes via gossip. The sorted redo log records form causal history, and the largest LSN for which all log records are contiguous (no holes) determines when segments are complete (what differentiates segments is probably based on content?).
Then, contiguous LSNs acknowledged at various granularities maintain system consistency. Segment-complete (SCL) determines consistency at the segment level, protection group-complete (PGCL) determines consistency at the protection group level–which is the quorum group level, and volume-complete (VCL) determines consistency across protection groups and for the database instance. Then, to commit a transaction, a database instance waits until the volume is completed, including writes for the active transaction (VCL >= system commit number (SCN)).
The group commit approach seems very nice as it decouples the system progressing and acknowledging writes from when a transaction commits, which I think makes transactions minimal overhead, assuming that data writes are the main unit of useful work instead of transactions.
When taking Lindsey Kuper’s class in the Fall, I seemed to have derived something similar as what is mentioned in section 4.2 about quorum sets of unlike members. I think this is a very awesome idea and can probably be carefully planned in such a way that quorum overlap usually hits the fast nodes, and thus can provide information that is separately mentioned in 3.1 and 3.2.
The mixed use of logical and physical logging is very cool. I tend to like mixing ideas like these in order to get the benefits of both. Logical logging is very lightweight, but requires that the record still be processed. Whereas physical logging is more complex but much faster. The benefit of the use in this paper is that the intended scale makes the benefit of physical logging replication for performance far outweigh the extra work that must go into undo records.
There is little clarification on the protection group policy. It seems like it must be a separate group of machines, but if so then it makes the “6 replicas” sound a lot less than it really is. For large deployments of Aurora, how many protection groups are needed for varying performance requirements? Does each storage node in a protection group ack back to the primary storage node or the primary instance (as labeled in figure 2)?
What is a protection group? If segments A1 to F1 are replicas of each other, and odd segments are acknowledged at PG1 and even segments are acknowledged at PG2, are PGs separate storage nodes? There are only 6 replicas of each storage node, and there’s only 1 primary instance, so I am not clear what a PG is.
Actually, if I understand protection groups correctly, they are a set of storage nodes, where each storage node stores a segment and peer storage nodes store a replica of the segment. Then, having 2 protection groups means that there are 12 storage nodes, split across the 2 protection groups, sharing load and improving throughput. This seems powerful, but I wonder if it also enables fast striping for large blobs or provides “just enough” availability.