Reading Summary 2019-05-14
· by Aldrin Montana
Too Big to Eat: Boosting Analytics Data Ingestion from Object Stores with Scoop
The authors are trying to address the performance issues of analytics frameworks that ingest data into an object store and then access the data for compute. The separation of these 2 phases and the bottleneck of having to read data out of disaggregated storage before running analysis is shown to be inefficient.
There are many related projects that push data filtering down to an active storage layer, such as Cybertron, Rhea, Diamond, and Minimal. There are also a few related active storage systems. However, none of these other approaches addressed the pushdown of data filters in a flexible manner that is as close to the backend object store as scoop, and they also did not support data locality.
The authors propose that data ingest should have ETL-type actions applied and that analysis should be able to offload some queries to an active object storage layer–essentially a layer that is able to run some compute close to the storage instead of waiting for the data to be communicated over the wire.
The flexibility of Scoop is significantly better than these other related projects. A primary distinguishing aspect mentioned by the authors is that scoop extends the object store (swift) with a platform that can apply pushed down data filters onto object requests.
For queries with >= 80% selectivity, scoop shows an order of magnitude speedup of spark SQL, and nearly 2x performance for queries with ~20% selectivity. For these particular experiments (Figure 5), there are no cases where there is a drop in performance. For the smart energy grid domain (e.g. GridPocket), scoop showed 6-18x speedup for a 50GB dataset, and 30-37x speedup for a 500GB dataset. Scoop also outperforms swift (by a lot) and parquet formats. In addition, scoop appears to have better CPU utilization than swift, but because it is so much faster it is able to release resources for every experiment far earlier than spark or swift.
This work provides compelling evidence that flexible pushdown of predicates and accommodations of data locality were still not adequately explored, even as recently as April 2017. I think that this work is also a good example of taking classic ideas (databases) and applying them in another field (storage systems) that has only recently seen hardware improvements that are driving different trade-offs than the field has seen before.
Comments and Questions
I am surprised that using swift as a backend was able to provide this type of speedup for spark SQL, and so it makes me wonder how much of the speed up is due to differences between swift and HDFS (which I assumed their evaluation of spark-only used).
I thought it was interesting that scoop shows greater speedup over swift than over parquet, which makes me remember that parquet was published/open-sourced when spark was still relatively new and thus was designed with some older assumptions of storage device characteristics. So, I wonder if parquet and HDFS simply have relatively good alignment of design and so that is why that the speedup of scoop over parquet is significantly less than scoop over swift.
It’s interesting to propagate predicate parameters via metadata in HTTP requests. It highlights the flexibility of HTTP as a channel between compute and storage nodes that allows for more expressive hints compared to trying to expand a low-level API. With this type of approach, it may even be possible to use sidecars (proxies) to do more complex transforms and optimizations of predicate parameters in HTTP requests without ever needing significant changes to the API itself.