Learning automata-based algorithms for MapReduce data skewness handling

Mohammad Amin Irandoost,Amir Masoud Rahmani,Saeed Setayeshi

doi:10.1007/s11227-019-02855-0

Abstract

One of the most successful techniques for large-scale data processing is MapReduce. However, the performance of this technique is significantly reduced when there is skewness in data. The hash function is the default partitioner in Big Data frameworks such as Hadoop and Spark. Hash works perfectly when there is no data skewness, which is not the case in natural events. In this paper, we proposed two new algorithms, namely learning automata partitioner (LAP) and traffic cost-aware partitioner (TCAP) based on learning automata for handling reducer-side data skewness in MapReduce applications. LAP is based on clusters combination and performs well when data skewness degree is low. TCAP, on the other hand, has the advantage of considering network topology and balancing network traffic cost in the shuffling phase. TCAP supports cluster splitting and performs well in any data skewness degree. LAP and TCAP can also be used in heterogeneous environments. The performance of our algorithms was evaluated by several experiments and simulations by well-known benchmarks. The results confirmed that our algorithms performed better than other similar algorithms in most cases.

Full Text