MapReduce-like Frameworks Research Articles

Big data processing applications have been migrated into cloud gradually, due to the advantages of cloud computing. Hadoop Distributed File System (HDFS) is one of the fundamental support systems for big data processing on MapReduce-like frameworks, such as Hadoop and Spark. Since HDFS is not aware of the co-location of virtual machines in the cloud, the default scheme of block allocation in HDFS does not fit well in the cloud environments behaving in two aspects: data reliability loss and performance degradation. In this paper, we present a novel location-aware data block allocation strategy (LDBAS). LDBAS jointly optimizes data reliability and performance for upper-layer applications by allocating data blocks according to the locations and different processing capacities of virtual nodes in the cloud. We apply LDBAS to two stages of data allocation of HDFS in the cloud (the initial data allocation and data recovery), and design the corresponding algorithms. Finally, we implement LDBAS into an actual Hadoop cluster and evaluate the performance with the benchmark suite BigDataBench. The experimental results show that LDBAS can guarantee the designed data reliability while reducing the job execution time of the I/O-intensive applications in Hadoop by 8.9% on average and up to 11.2% compared with the original Hadoop in the cloud.

Read full abstract

MapReduce-like frameworks have achieved tremendous success for large-scale data processing in data centers. A key feature distinguishing MapReduce from previous parallel models is that it interleaves parallel and sequential computation. Past schemes, and especially their theoretical bounds, on general parallel models are therefore, unlikely to be applied to MapReduce directly. There are many recent studies on MapReduce job and task scheduling. These studies assume that the servers are assigned in advance. In current data centers, multiple MapReduce jobs of different importance levels run together. In this paper, we investigate a schedule problem for MapReduce taking server assignment into consideration as well. We formulate a MapReduce server-job organizer problem (MSJO) and show that it is NP-complete. We develop a 3-approximation algorithm and a fast heuristic design. Moreover, we further propose a novel fine-grained practical algorithm for general MapReduce-like task scheduling problem. Finally, we evaluate our algorithms through both simulations and experiments on Amazon EC2 with an implementation with Hadoop. The results confirm the superiority of our algorithms.

Read full abstract

MapReduce-like Frameworks Research Articles

Related Topics

Articles published on MapReduce-like Frameworks

LDBAS: Location-aware Data Block Allocation Strategy for HDFS-based Applications in the Cloud

Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments

Moving Big Data to The Cloud: An Online Cost-Minimizing Approach

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

MapReduce-like Frameworks Research Articles

Related Topics

Articles published on MapReduce-like Frameworks

LDBAS: Location-aware Data Block Allocation Strategy for HDFS-based Applications in the Cloud

Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments

Moving Big Data to The Cloud: An Online Cost-Minimizing Approach