The exponential growth of data in many science and engineering domains poses significant challenges to storage systems. Data distribution is a critical component in large-scale distributed storage systems and plays a vital role in placing petabytes of data and beyond, among tens to hundreds of thousands of storage devices. Meantime, heterogeneous storage systems, such as those having devices with hard disk drives (HDDs) and storage class memories (SCMs), have become increasingly popular for massive data storage due to their distinct and complement characteristics. This paper presents a new data distribution algorithm called SUORA (Scalable and Uniform storage via Optimally-adaptive and Random number Addressing) specifically for heterogeneous devices to maximize the benefits of them. SUORA provides a fully symmetric, highly efficient methodology to distribute data across a hybrid and tiered storage cluster. It divides heterogeneous devices into different buckets and segments, and adopts pseudo-random functions to map data onto them with the balanced consideration of capacity, performance and life-time. By analyzing hotness and access patterns, SUORA gradually moves hot data from HDDs to SCMs to optimize the throughput, and moves cold data reversely for load balance. It combines data replication with migration to significantly reduce movement overhead while making data placement more adaptive to different workloads. Extensive evaluations on simulation and Sheepdog storage system show that, with considering distinct characteristics of various devices thoroughly, SUORA improves the overall performance efficiency of heterogeneous storage systems.
Read full abstract