Mitigate I/O access pattern divergence with heterogeneous architecture in HDFS

Huashen Shi Huashen Shi

doi:10.1109/iccsnt.2015.7490756

Abstract

Hadoop has become the de-facto big data processing framework. Since the success of Hadoop in large data set processing, more and more companies and organizations tend to build their analysis logics into Hadoop stack and share across different develop teams for cost-effective. Hadoop now undertakes not only the original batch computation but also lowlatency online queries. However the Hard Disk Drive (HDD) used in Hadoop storage system performs poorly when facing random request and disk contention. The isomorphic HDD storage layer in HDFS encounters the I/O access pattern divergence inevitably. To this end, a promising trend in storage system is to utilize hybrid and heterogeneous devices - Solid State Disks (SSD), which can help achieve very high I/O rates at acceptable cost. However, previous works mostly focus on separated data flow phases thus leading to poor applicability in different application scenes. In this paper, we present a novel heterogeneous architecture which can separate the I/O access into sequential pattern and random pattern. We mitigate the divergence of I/O access through heterogeneous storage system, i.e., HDD serves the sequential I/O request while SSD provides low-latency random file access. We evaluate our system using an actual implementation on a medium-sized cluster consisting of HDDs and different numbers of SSDs with workloads from a leading search engine company. Experiments show that our system outperforms the original system 17% in disk utilization and reduces 12% in job duration time in average.

Full Text