Host managed contention avoidance storage solutions for Big Data

Pratik Mishra,Arun K Somani

doi:10.1186/s40537-017-0080-9

Abstract

The performance gap between compute and storage is fairly considerable. This results in a mismatch between the application needs from storage and what storage can deliver. The full potential of storage devices cannot be harnessed till all layers of I/O hierarchy function efficiently. Despite advanced optimizations applied across various layers along the odyssey of data access, the I/O stack still remains volatile. The problems associated due to the inefficiencies in data management get amplified in Big Data shared resource environments. The Linux OS (host) block layer is the most critical part of the I/O hierarchy, as it orchestrates the I/O requests from different applications to the underlying storage. Unfortunately, despite it’s significance, the block layer, essentially the block I/O scheduler, hasn’t evolved to meet the needs of Big Data. We have designed and developed two contention avoidance storage solutions, collectively known as “BID: Bulk I/O Dispatch” in the Linux block layer specifically to suit multi-tenant, multi-tasking shared Big Data environments. Hard disk drives (HDDs) form the backbone of data center storage. The data access time in HDDs is majorly governed by disk arm movements, which usually occurs when data is not accessed sequentially. Big Data applications exhibit evident sequentiality but due to the contentions amongst other I/O submitting applications, the I/O accesses get multiplexed which leads to higher disk arm movements. BID schemes aim to exploit the inherent I/O sequentiality of Big Data applications to improve the overall I/O completion time by reducing the avoidable disk arm movements. In the first part, we propose a dynamically adaptable block I/O scheduling scheme BID-HDD for disk based storage. BID-HDD tries to recreate the sequentiality in I/O access in order to provide performance isolation to each I/O submitting process. Through trace driven simulation based experiments with cloud emulating MapReduce benchmarks, we show the effectiveness of BID-HDD which results in 28–52% lesser time for all I/O requests than the best performing Linux disk schedulers. In the second part, we propose a hybrid scheme BID-Hybrid to exploit SCM’s (SSDs) superior random performance to further avoid contentions at disk based storage. BID-Hybrid is able to efficiently offload non-bulky interruptions from HDD request queue to SSD queue using BID-HDD for disk request processing and multi-q FIFO architecture for SSD. This results in performance gain of 6–23% for MapReduce workloads when compared to BID-HDD and 33–54% over best performing Linux scheduling scheme. BID schemes as a whole is aimed to avoid contentions for disk based storage I/Os following system constraints without compromising SLAs.

Highlights

Data Centers today cater to a wide diaspora of applications, with workloads varying from data science batch and streaming applications to decoding genome sequences
Through trace driven simulation based experiments with cloud emulating MapReduce benchmarks, we show effectiveness of BID-Hard disk drives (HDDs) which results in 28–52% I/O time performance gain for all I/O requests than the best performing Linux disk schedulers
“Issues with current I/O schedulers” section describes the working of the current state-of-the-art Linux disk schedulers deployed in shared Big Data infrastructure

Summary

Introduction

Data Centers today cater to a wide diaspora of applications, with workloads varying from data science batch and streaming applications to decoding genome sequences. With the aid of server and storage virtualization, multiple processes contend for the same physical resource (namely, compute, network and storage) [2] We have designed and developed two Contention Avoidance Storage solutions in the Linux block layer, collectively known as “BID: Bulk I/O Dispatch”, to suit multi-tenant, multi-tasking Big Data shared resource environments. The main goal of BID-Hybrid is to further enhance the performance of BIDHDD scheduling scheme, by offloading interruption causing non-bulky I/Os to SSD and thereby making the “HDD request queue” available for bulky and sequential I/Os. Contrary to the existing literature of tiering, where data is tiered based on deviation of adjacent disk block locations in the device “request queue”, BID-Hybrid profiles process I/O characteristics (bulkiness) to decide on the correct candidates for tiering. We conclude the paper in “Conclusion and future works” section with a discussion on future work

Background

C1 SQC Queues for other processes

Findings

Conclusion and future works