Reading Summary 2019-05-14
· by Aldrin Montana
Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required
Problem Statement and Related Work
Amazon had been seeing large data growth from many customers, but some customers already had data sizes far larger than what most analytics frameworks were capable of processing. The main problem, even after providing solutions such as redshift, is that Amazon saw an impossible trade-offs for these customers with huge datasets: Use powerful, simple analytics frameworks (map reduce) on scalable storage (S3), or powerful, complex analytics frameworks (redshift) on fast storage (local disks).
Amazon’s solution to the “tyranny of OR” problem (absolutely stupid name, by the way) was to create a new layer, redshift spectrum, that is capable of running redshift queries against S3 object storage. This is a huge benefit over previous approaches because it provides powerful analytics for complex queries on scalable object storage, which can also be queried using map reduce.
In this way, specialized local disks are no longer necessary, and S3 can be used for cheaper storage and unified storage that doesn’t suffer from the need for extra ETL.
This work was important for simplifying many customer queries over data, and probably makes it easier for Amazon to unify storage and processing in a way that makes their work much more efficient. I suppose that this also provides a significant motivation for transitions to object stores so that data can be “flexibly” migrated/replicated in S3 or internal storage systems such as Ceph or Swift.
Comments and Questions
I wonder if it was necessary to change redshift’s query plans from an interpreted language to an intermediate language which redshift spectrum then optimizes to an object store based language.
I wonder if Amazon uses query compilation instead of query interpretation in order to squeeze out as much performance as possible over workloads that potentially access petabytes of data (although I assume that while this much data is stored, that much data is almost never analyzed at once).
I wonder if the storage model of S3 is any different from the “high-performance local disks” used by redshift (does Amazon use S3 as a backend object store with a POSIX filesystem interface), and, relatedly, I wonder if the use of redshift spectrum on S3 allows for more data access path optimizations compared to redshift over a local filesystem.