Reading Summary 2019-04-30
· by Aldrin Montana
WiscKey - Separating Keys from Values in SSD-Conscious Storage
The authors are trying to reduce the write amplification that LSM-trees have so that they are better suited to flash storage (SSDs).
What other approaches or solutions existed at the time that this work was done? What was wrong with the other approaches or solutions?
There are many similar approaches for SSDs that use the LSM-tree approach: FAWN, FlashStore, SkimpyStash, BufferHash, and SILT. Of these, FAWN uses an append-only log on the SSD and uses an in-memory hash table. FlashStore and SkimpyStash improve on the in-memory hash table using cuckoohashing and compact key hashes, and storing part of the hash table on the SSD using linear chaining, respectively.
Other LSM-tree based approaches try to improve the scheduler, compaction, and parallelism (via internal flash channels) of LSM-trees. These are bLSM, VT-tree, and LOCS, respectively. Atlas seems to approximate WiscKey by storing keys and values on different drives, but it is an approach based on ARM processors and erasure coding.
There are many other related works but they begin to differ significantly in approach.
The authors approach focuses on the key insight that compaction of pages in an LSM-tree can be much more efficient if (large) values are stored off-page in a log. This allows LSM pages (or SSTables in LevelDB) to fit more keys because the values stored with the keys are just locations where the valuese are actually written. Then, the write amplification of appending to LSM pages and then running compaction is limited to something along the lines of a 16 bit key and a 1KB value, which is significantly less than a 16 bit key and upwards of several MBs (or larger) values.
Overall performance of WiscKey is very good compared to LevelDB, as it can achieve up to 4x better throughput for sequential load, and LevelDB has terrible random-load performance. All of the workloads are 100-GB dataset with the value size varying from 64B to 256KB. Most importantly, WiscKey greatly reduces write amplification of random workloads by at least 4x. Even for range query performance, WiscKey performs much better than LevelDB for values larger than 4KB. For values less than 4KB, WiscKey has comparable performance, within 50 MB/s throughput.
One important characteristic of this work is the utility of a seemingly small change (co-locate keys and values or separate values from keys using a level of indirection) can have for tuning a data structure to better match the workloads or better utilize hardware.
In addition, this paper provides some insights into the properties of flash storage that others may want to target.
Comments and Questions
I wonder if other approaches to garbage collecting deleted values were explored. The described method requires walking the log in a specific range and querying the LSM-tree to see if the values are still valid. I think this very much is inspired by compaction which is cool, but I wonder if there could simply be something that runs in the background immediately following compaction that goes through the vLog to remove values corresponding to keys that were removed during compaction.
I think the evaluation would be even more compelling if they included a version of LevelDB that separates values into a vLog and tried to keep everything else as much the same as possible. Then it would be interesting to see if structures such as bloom filters still provide a performance benefit and see how the read amplification is reduced. This would bridge the gap between a simpler system that separates and key values and a complicated system that tries to address read/write performance but does not try to address read/write amplification.
The authors mention the tradeoff of read, write, and space amplification and how it’s impossible to reduce all 3. The space amplification of WiscKey is minimal for large value sizes, but is up to 1.3 for small value sizes (64B). This makes me think of Albis, where space amplification is not prioritized because the desire is to achieve high performance.