Reading Summary 2019.05.22
· by Aldrin Montana
Noria: dynamic, partially-stateful data-flow for high-performance web applications
Noria is a system that manages data-flow programs that have SQL-like semantics and accommodate partial results. These properties are core to Noria and imply other properties that distinguish Noria from other data-flow systems. Because the data-flow graph is acyclic, it does not support fixedpoint semantics like Naiad does. I think this is due to the SQL-like semantics, which do not express recursion. Partial results allow for many interesting properties, such as: the ability to dynamically add new internal and external views (without taking the data-flow program offline) and the ability to evict stale values for space (garbage collection) or compute results on incomplete data.
The ability to compute on incomplete data is especially interesting because it means that if upstream (ancestor, or source) views or tables are distributed across a network, then there is an opportunity to provide results as they come in, thus providing eventual consistency guarantees instead of strong consistency guarantees. For the type of applications that Noria targets, this is functionally acceptable and preferable for performance. This also enables the possibility of efficiently backfilling data to recompute over, assuming associated downstream records have not been evicted or garbage collected.
I really like the approach and the above properties provided and I think there are several directions for future related research.
The use of partial state to allow: garbage collection, value eviction, and eventual consistency seems really great to me.
Although I think the comparison to Naiad is a bit misleading, I like that some comparison was made and I think that the similar performance up to 6 nodes is a good sanity test of the Noria implementation.
I feel that some of the understatements in the paper are detrimental to the reader. Specifically, I think the point that Noria exposes a SQL interface understates the types of workloads that it targets compared to Naiad. I think many experienced application developers will understand whether a SQL interface fits their requirements, but I suspect that the evaluation compared to Naiad does not seem honest enough about what the trade-offs are between the two systems.
It seemed to me that upqueries are a solution to the presence of incomplete data, but I don’t see them as being much different from the cache invalidation scheme in the sense that they will reduce performance on the read path. Further, one of the mentioned benefits of Noria is that it doesn’t have to restart the data-flow program to add new views. I would like to see evaluation that explores these 2 points, even if briefly.
I feel like not having an implementation of multi-column joins is significant unless writes are de-normalized. I didn’t see much discussion of this and the impact for web applications. I think an extra paragraph or so would be nice.