· by Aldrin Montana
In order to meet industry needs, data de-duplication has become popular in disk-based systems to defer the need to archive to tape. In this paper, the authors describe 3 techniques used in the Data Domain File System to address disks as a bottleneck in systems that do data de-duplication. The techniques detailed are: (1) Summary Vector, (2) Stream-informed Segment Layout, and (3) Locality Preserved Caching.
The earliest data deduplication systems implemented file-level hashing to detect duplicate files. This approach clearly has limitations in the amount of data that can be de-duplicated. Venti is a previous project that re-used secure hashes for de-duplicating fixed size data blocks, but due to a lack of locality, the throughput was very low.
Now, modern storage systems implement de-duplication for variable-sized data blocks which move and change less frequently. Much of the previous work in this area has used fingerprints for determining whether content is duplicated and has been useful for network applications and distributed storage systems. Finally, the summary data structure for the proxy cache in previous work by Li Fan, et al. was inspiration for the authors of this paper to use bloom filters to implement the summary vector.
The primary problem with all previous work is that throughput was never prioritized, unlike in this paper.
The proposed solution is to add an extra few layers of indirection to improve throughput of the de-duplication mechanisms. Specifically, the content store can focus on how to split data into segments, the segment store can focus on how to filter and pack segments, and the content manager can focus on an underlying log storage mechanism that provides a good, low-level feature set such as verification, end-to-end validation, and others.
This work seemed to have contributed a great amount to improving the state-of-the-art in data de-duplication systems backed by hard disks. The key contributions being the focus on throughput optimization and the layers of abstraction that enable the performance improvements.
I believe naive implementations of hash maps or dictionaries tend to suffer from the same problem that the authors mention hashing of segments has: low cache hit-rates due to poor locality. I wonder if the locality preserving cache has been re-used at higher levels of abstraction–for languages such as python–in order to mitigate the mismatched performance characteristics of hash maps at the logical or language level compared to at the hardware level.