Reading Summary 2019-04-03

This is a repository for my research, paper reading summaries/reviews, and relevant blog-like posts in markdown.

Reading Summary 2019-04-03

ยท by Aldrin Montana

Time, Clocks, and the Ordering of Events in a Distributed System

What problem does the paper solve? Is it important?

The paper addresses the problem of event ordering in a distributed system. Primarily, this is done by answering the question of whether an event A happened before an event B, which allows us to order pairs of events. This is a very important question for applications that provide concurrent access to shared, mutable state by distributed processes.

How does it solve the problem?

The question of which events happen before other events is solved by the definition of a particular relation of interest, happens before. This relation captures 3 properties: (1) all events in a process are in order of happens before from earliest event to latest event, (2) for a given communication message, the sending process happens before the receiving process, (3) happens before is transitive.

What alternate solutions exist? Are they adequately discussed?

An alternative approach, discussed in the paper, is the use of physical/real clocks because physical time is global to all devices/agents. Although, physical clocks that adequately capture real time are incredibly difficult to implement and the smallest skews can throw off synchronization between nodes/agents. I think as far as when this paper was published, there were almost no other known alternatives.

How does this work relate to other research?

In general, this work is towards understanding dependencies between multiple processes. In the distributed systems case, processes are on different machines and dependencies (if A happens before B, then B depends on A) may indicate points where failures may cascade or halt the system. I think this relation can be extended to single-node parallelism where dependencies may indicate where a task needs to be restarted after a failure, or branches of a computation that can be scheduled in parallel, or other scheduling/resilience scenarios. It may also be relevant for recovery of data management systems, where determining dependencies is necessary for knowing what other options may need to be rolled back (assuming steals and logical commits instead of physical commits).

What specific research questions does the paper raise for you?

Having read this paper before, I still think about the trade-offs of using logical clocks in lamport style, compared to logical vector clocks. When is the dense representation of vector clocks favorable compared to lamport clocks given the memory overhead and processing overhead for elastic systems or large-scale distributed systems.