Reading Summary 2019-05-28
· by Aldrin Montana
A Case for Packing and Indexing in Cloud File Systems
The authors are interested in making writes to cloud storage more efficient and cost effective, and discuss batching and coalescing small files into packed files as an approach.
The authors propose an intermediate layer of a master node and worker nodes to receive, pack, and index writes to then persist in cloud storage. Reads can be made against worker nodes if the specified file has not been packed, but once packed, reads are redirected to relevant blob extents in cloud storage.
The performance is roughly 2 orders of magnitude better cost efficiency and 2 orders of magnitude fewer write requests.
This work mostly seems to contribute the idea that even with cloud storage, previous techniques in storage systems can be helpful in performance while using cloud storage as a substrate, or persistent data store.
Comments and Questions
I’m a bit unconvinced regarding the generality of the evaluation. There is a figure early on that shows a curve of efficiency for packing and indexing for various file sizes, then the evaluation chooses a size that is going to benefit the most. I suppose this is useful for showing maximal gains, but it just doesn’t seem particularly realistic.
I feel that this feature is pretty cool for hot, unstructured data (or at least data that doesn’t go into a database). But, considering that Alluxio probably maintains the lineage approach of Tachyon for files, I am curious how this feature is exercised in a realistic Alluxio system.
I get the feeling that Alluxio gets the most benefit from this feature for multi-tenant deployments, but then I feel like locality could really suffer. I’m curious whether Alluxio handles multi-tenancy at all, or if this is a per tenant feature; in which case I wonder if there are approaches for maximizing temporal or spatial locality and if there is anything for transforming between the two packed versions.