Reading Summary 2019-04-15

· by Aldrin Montana

Impossibility of Distributed Consensus with One Faulty Process

What problem does the paper solve? Is it important?

The paper addresses an aspect of distributed systems that many implementers fall into the trap of attempting–consensus in a distributed system. Unfortunately, the solution is that there is no solution. In any distributed system where a process may fail (partial failure), consensus is impossible regardless of the algorithm.

This solution is very important. While many practictioners may informally give up the search for a correct distributed consensus algorithm, this solution shows that no one should even make an attempt. Other than saving many developers a lot of time, it shows a limit to one aspect of distributed systems.

How does it solve the problem?

This paper takes a theoretical approach–which is necessary to prove the impossibility theorem they set out to prove. Specifically, they define a model and then show that in this model there are three lemmas which can be used to prove that distributed consensus is impossible: (1) If two sequences of operations are disjoint, then they can be applied in any order to obtain the same end state; (2) The initial state of any distributed system can agree on any one of all possible decision values; (3) Given possible states and a particular event, the event can always be applied or observed at the end of a sequence of operations and from a bivalent state (any 2 of 2 possible values can be decided on) the next state after any such event would also be bivalent. The summary of the main theorem which uses the above 3 lemmas is: for a run of a consensus algorithm to be correct, we need to reach a state from which a single possible value must be decided by every node; but there is always an event that can occur that maintains every resulting state as an ambiguous state (any decision is possible).

What alternate solutions exist? Are they adequately discussed?

No alternate solutions appear to be discussed in the paper. Many of the references are from the same authors (or at least Nancy Lynch), but there appear to be at least 2 references that provide the same model and investigate the exact same lemmas and similar theorems: (1) Asynchronous Byzantine Consensus and (2) Resilient Consensus Protocols.

It appears that this particular work was re-published in the Journal of the ACM, and the above references built on this work when initially published in SIGACT/SIGMOD to try and address some of the implications of the impossibility theorem.

How does this work relate to other research?

This work is very related to any distributed systems work in one of two ways: either why a distributed system does not attempt to rely on a consensus protocol for a core decision, or why a distributed system uses heuristics to achieve consensus.

What specific research questions does the paper raise for you?

I think this paper itself doesn’t raise questions for me, but indirectly raises the question of: what the hell do I do when I need a distributed system to make a decision globally? I suppose the best answer is that I need to either determine a clear bound and heuristically do my best to stick to it, or I avoid global decisions at all cost. If a correct, distributed consensus is impossible, then there is no reason for important decisions to be made by multiple nodes. Then, anything that does not require every node to agree on a decision can be distributed.