What is the problem the authors are trying to solve?
Fault-tolerance and performance of RAID disk arrays which had previously been a centralized architecture with a single RAID controller as a point of failure.
What other approaches or solutions existed at the time that this work was done? What was wrong with the other approaches or solutions?
The previous approach was not scalable due to a single RAID controller, which served as a single interface to storage disks. Additionally, if errors occurred in the controller, then all of the disks in the array would be unavailable for access.
The alternate distributed approach did not provide functionality that span multiple disks, however, and so while latency of each disk may be faster and more robust in case of failure of the controller, array functions (striping, mirroring) were not provided.
In database systems there were 2 alternate approaches that are distinct from the approach provided in this paper: distribute access over networked storage nodes so that if failure of a node or site occurs, another node or site is available; and an approach similar to the alternate distributed storage approach, where multiple disks could be used but without a parallel RAID interface.
What is the authors’ approach or solution?
The authors’ suggest that a parallele RAID implementation within a single node will provide improved performance and reliability (fault-tolerance) without having to consider a network of multiple machines. Namely, they detail and demonstrate a parallele RAID controller that implements a variety of features such as partial-write orderings, distributed parity calculation, and others.
Why is it better than the other approaches or solutions?
This approach is better as it not only provides parallel access to each disk, it provides RAID functionality such as striping and mirroring which no other parallel or distributed solutions provide. Also, this approach is able to distribute the load of parity calculation which RAID-II could only do in a central controller (or originator node).
How does it perform?
For request sizes less than 300 KB, tickerTAIP performs better than the centralized RAID controller at lower CPU speeds, but for larger request sizes the throughput and response times vastly outperform the centralized controller.
Why is this work important?
This work is important as it appears to be the first distributed RAID controller architecture. RAID is important for non-consumer use cases especially where data speed or redundancy for common operations is important. This work is able to not only provide better performance, but fault-tolerance means that faults are less likely to halt all operations.
The performance results include a fault intolerant striping configuration of tickerTAIP, but no information on how often faults occurred was provided.
When a fault occurs in a tickerTAIP RAID, I wonder how such a fault is propagated to admins so that the node or disks can be replaced. For that matter, even though the RAID is fault-tolerant, in most scenarios does that mean that it just is functional long enough to bring the system down to replace nodes or disks? or was hot-swapping an existing technology at the time?
If multiple host interfaces are attached to a particular originator node, I wonder if there is still an upper limit of 3ish streams to the disks that upper bounds the scalability of the tickerTAIP architecture.