Performance, Reliability, and Performability Aspects of Hierarchical RAID

Alexander Thomasian,Yujie Tang

doi:10.1109/nas.2011.45

Abstract

We consider a two level Hierarchical RAID: HRAIDk/ℓ with MDS erasure coding, so that it tolerates k node and ℓ disk failures per node with the minimum redundancy level. With no controller failures WRAIDk/ℓ tolerates all (k + 1)(ℓ + 1) - 1 disk failures, while the maximum number of disk failures tolerated in an array with N nodes and M disks per node is N × ℓ + (M - ℓ)k. We vary RAID controller failure rates with respect to the disk failure rate in a simulation study to determine the Mean Time to Data Loss (MTTDL) and HRAID performability defined as number of I/Os processed till the system fails, while disks are processing requests at their maximum bandwidth. Rebuild via restriping is used to handle both disk and node failures until check strips are exhausted. WRAIDk/ℓ provides the higher MTTDL and performability for k <; ℓ when controllers are more reliable than disks and vice-versa, so that the level of redundancy at the two levels should be balanced. The distribution of the number of disk failures to data loss is affected by N and M. The mean and the 95 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">th</sup> percentile of this number may be used to determine the minimum controller failure rate with respect to disk failure rates in a design study to equalize the probability of data loss due to disk and controller failures. We investigate the cost of fault-tolerance by comparing HRAIDk/ℓ performance using maximum I/O rates (IOPS) and mean response times as metrics. System performance is at its worst after controller and disk failures, when the system is operating in degraded mode, but surprisingly the max IOPS increases after restriping, which is due to the reduction in small write penalty as check strips are overwritten. Finally, we undertake a design study to determine the number of disk failures leading to one or more node failures, which is used to determine the maximum controller failure rates. We investigate system performance (maximum IOPS and read response times) in degraded mode and with varying levels of redundancy. There is a strong correlation between MTTDL and performability and k <; ℓ yields the lower performability for highly reliable failure rate.

Full Text