Abstract
MapReduce is a commonly used framework for parallel processing of data-intensive tasks, but its performance usually suffers from heavy communication load incurred by the shuffling of intermediate values (IVs) among computing servers. Recently, the Coded MapReduce framework is proposed which uses a coding scheme named coded distributed computing (CDC) to trade the communication load with extra computation in MapReduce. CDC can achieve the optimal computation-communication tradeoff when all the IVs have the same size. However, in many practical applications, the sizes of IVs can vary over a large range, leading to inferior performance. In this paper, we introduce a generalized CDC scheme which takes the sizes of IVs into account and then propose a combinatorial optimization problem aiming to minimize the communication load when the computation load is fixed. We show that the problem is NP-hard, and further propose a very efficient algorithm which achieves an approximation ratio of 2. Experiments conducted on Alibaba Cloud show that, compared to the original CDC scheme, our proposed IV size aware approach can significantly reduce the communication load and achieve a lower total execution time.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.