Abstract

We propose a new optimal data placement technique to improve the performance of MapReduce in cloud data centers by considering not only the data locality but also the global data access costs. We first conducted an analytical and experimental study to identify the performance issues of MapReduce in data centers and to show that MapReduce tasks that are involved in unexpected remote data access have much greater communication costs and execution time, and can significantly deteriorate the overall performance. Next, we formulated the problem of optimal data placement and proposed a generative model to minimize global data access cost in data centers and showed that the optimal data placement problem is NP-hard. To solve the optimal data placement problem, we propose a topology-aware heuristic algorithm by first constructing a replica-balanced distribution tree for the abstract tree structure, and then building a replica-similarity distribution tree for detail tree construction, to construct an optimal replica distribution tree. The experimental results demonstrated that our optimal data placement approach can improve the performance of MapReduce with lower communication and computation costs by effectively minimizing global data access costs, more specifically reducing unexpected remote data access.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call