MOHA: Many-task computing meets the big data platform

Jik-Soo Kim,Soonwook Hwang,Cao Ngoc Nguyen

doi:10.1109/escience.2016.7870900

Abstract

Many-Task Computing (MTC) has been a new computing paradigm that aims to bridge the gap between traditional High-Throughput Computing (HTC) and High-Performance Computing (HPC). MTC applications from various scientific domains such as pharmaceuticals, astronomy, physics often consist of a very large number (from thousands to even billions) of data-intensive (tens of MB of I/O per second) tasks with relatively short per task execution times (from seconds to minutes). Each task in MTC applications may require relatively small amount of data processing especially compared to existing Big Data applications typically based on larger data block sizes (e.g. the default block size in Hadoop is 64MB). However, they can consist of much larger numbers of tasks where each task communicates through files instead of message passing interfaces such as MPI in HPC applications. Therefore, MTC can be another type of data-intensive workload where a large number of data processing tasks should be efficiently processed within a relatively short period of time. In this paper, we present design and implementation of MOHA (Many-task computing On HAdoop) which can make an effective convergence of MTC technologies and the existing Big Data platform Hadoop. MOHA is developed as a Hadoop YARN application so that it can transparently co-host existing MTC applications with other Big Data processing frameworks such as MapReduce in a single Hadoop cluster. Our evaluation results based on microbenchmark show that MOHA can substantially reduce the overall execution time of many-task processing with minimal amount of resources compared to an existing Hadoop YARN application. In addition, MOHA can efficiently dispatch a large number of tasks which can be crucial to support challenging MTC applications. MOHA can bring many interesting research issues related to data grouping and declustering on Hadoop Distributed File System (HDFS), scalable job/metadata management, dynamic task load balancing which can ultimately contribute to a new data processing framework in the YARN based Hadoop 2.0 ecosystem.

Full Text