Erasure coding is an effective technique for guaranteeing data reliability for storage systems, yet it incurs a high repair penalty with amplified repair traffic. The repair becomes more intricate in clustered storage systems with the bandwidth diversity property. We present TPRepair , a T ree-based P ipelined Repair approach, aiming to expedite the overall repair process with the tailored pipelined repair procedure. TPRepair first prioritizes selecting racks with the current minimum load to participate in the repair process. It subsequently formulates tree-based links, tailored to align seamlessly with the pipelined repair procedure. TPRepair further designs an optimization algorithm to reduce the bottleneck load when repairing multiple chunks. Large-scale simulations demonstrate that TPRepair can increase 13.8%-41.3% of the balance ratio without amplifying cross-rack traffic. Meanwhile, Alibaba Cloud ECS experiments indicate that TPRepair can increase repair throughput by 11.3% to 72.9%.
Read full abstract