Integrating cyber, physical, and social spaces together, cyber-physical-social systems (CPSS) bring more conveniences to humans. For practical applications and user convenience, it is essential that the Big Data produced in CPSS be stored in the distributed storage systems of CPSS. In this paper, we study the fault tolerance scheme for distributed storage systems of CPSS, and propose a framework that can recover multiple failed nodes simultaneously. Considering the reliability of storage nodes in distributed storage systems, the research on locally repairable codes has mostly focused on repairing failed nodes within each repair group. However, when entire repair groups have failed, existing locally repairable codes cannot repair more than one failed group. In this paper, local codes with cooperative repair that can recover more than one failed group are proposed. Specifically, the proposed local codes are constructed based on minimum storage regenerating (MSR) codes, and have an interleaving structure among the local codes, so that the parity symbols of any local code can be generated from the MSR codes in its two adjacent local codes. Taking advantage of this property, more than one failed local group can be repaired cooperatively by their adjacent local groups with lower repair locality. Furthermore, the key parameters of local codes with cooperative repair are derived. Theoretical analysis and simulation results show that, compared with previous codes with local regeneration, our codes have higher bandwidth overhead when repairing failed nodes, but advantages in storage overhead and repair locality either for repair of a single failed node or one failed local group. Moreover for a single failed local group, local codes with cooperative repair achieve almost the same tradeoff curve of storage overhead and bandwidth overhead as MSR-local codes and minimum bandwidth regenerating local (MBR-local) codes.
Read full abstract