Reparo: A Fast RAID Recovery Scheme for Ultra-large SSDs
A recent ultra-large SSD (e.g., a 32-TB SSD) provides many benefits in building cost-efficient enterprise storage systems. Owing to its large capacity, however, when such SSDs fail in a RAID storage system, a long rebuild overhead is inevitable for RAID reconstruction that requires a huge amount of data copies among SSDs. Motivated by modern SSD failure characteristics, we propose a new recovery scheme, called reparo , for a RAID storage system with ultra-large SSDs. Unlike existing RAID recovery schemes, reparo repairs a failed SSD at the NAND die granularity without replacing it with a new SSD, thus avoiding most of the inter-SSD data copies during a RAID recovery step. When a NAND die of an SSD fails, reparo exploits a multi-core processor of the SSD controller in identifying failed LBAs from the failed NAND die and recovering data from the failed LBAs. Furthermore, reparo ensures no negative post-recovery impact on the performance and lifetime of the repaired SSD. Experimental results using 32-TB enterprise SSDs show that reparo can recover from a NAND die failure about 57 times faster than the existing rebuild method while little degradation on the SSD performance and lifetime is observed after recovery.