Optimal Copyset in Distributed Object Storage

Yaoguang Huo,Hui Li,Xin Yang,Han Wang,Junfeng Ma,Xiangzhen Meng

doi:10.1109/bigdata52589.2021.9671908

Abstract

In distributed storage systems, the replication mechanisms are usually used to ensure system reliability and data availability. Random replication is widely used in cloud storage systems to prevent data loss. Copyset Replication (CR) as a replication strategy, makes a nearly optimal trade-off between the number of scattered nodes and the probability of data loss. Compared with random replication, CR greatly reduces the probability of data loss caused by node failure. However, CR's random selection strategy makes it difficult to select the optimal copyset based on data characteristics such as calculation and storage. In response to this problem of CR, the Optimal Copyset Replication (OCR) proposed in this paper can select the optimal copyset according to the specified data characteristics and its corresponding node conditions. Finally, combined with Cyberspace Mimicry Defense (CMD) , we implemented OCR in a distributed object storage system and conducted related experiments. When the calculation type data reaches 300,000, the experimental results prove that compared with CR randomly selecting copyset, OCR reduces the data processing time by nearly 10% through selecting the optimal copyset. By setting relevant parameters, OCR can also ensure that the data distribution of each node is relatively uniform, and avoid data skew.

Full Text