Abstract

Cloud storage services are often associated with various performance issues due to load imbalance, interference from background tasks such as data scrubbing, backfilling, recovery, and the difference in processing capabilities of heterogeneous servers in a datacenter. This has a significant impact on a broad range of applications that are characterized by massive working sets and real-time constraints. However, it is challenging and burdensome for human operators to hand-tune various control-knobs in a cloud-scale storage cluster for maintaining optimal performance under diverse workload conditions. Our study on an open-source object-based storage system, Ceph, shows that common load balancing strategies are ineffective unless they are adapted according to workload characteristics. Furthermore, positive effects of an applied strategy may not be immediately visible. To address these challenges, we developed a machine learning based system adaptation technique that enables a cloud storage system to manage itself through load balancing and data migration with the aim of delivering optimal performance in the face of diverse workload patterns and resource bottlenecks. In particular, we applied a stochastic policy gradient based reinforcement learning technique to track performance hotspots in the storage cluster, and take appropriate corrective actions to maximize future performance under a variety of complex scenarios. For this purpose, we leveraged system-level performance monitoring and commonly available control-knobs in object-based cloud storage systems. We implemented the developed techniques to build an Adaptive Resource Management (ARM) system for object based storage cluster, and evaluated its performance on NSF Cloud's Chameleon testbed. Experiments using Cloud Object Storage Benchmark (COSBench) show that, ARM improves the average read and write response time of Ceph storage cluster by upto 50% and 33% respectively, compared to the default case. It also outperforms a state-of-the-art dynamic load rebalancing technique in terms of read and write performance of Ceph storage by 43% and 36% respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call