Abstract

Data is accumulating at an incredible rate, and the era of big data has arrived. Big data brings great challenges to traditional machine learning algorithms, it is difficult for learning tasks in big data scenario to be completed on stand-alone. Data reduction is an effective way to solve this problem. Data reduction includes attribute reduction and instance reduction. In this study, we focus on instance reduction also called instance selection, and view the instance selection as an optimal instance subset selection problem. Inspired by the ideas of cross validation and divide and conquer, we defined a novel criterion called combined information entropy with respect to a set of classifiers to measure the importance of an instance subset, the criterion uses multiple independent classifiers trained on different subsets to measure the optimality of an instance subset. Based on the criterion, we proposed an approach which uses genetic algorithm and open source framework to select optimal instance subset from big data. The proposed algorithm is implemented on two open source big data platforms Hadoop and Spark, the conducted experiments on four artificial data sets demonstrate the feasibility of the proposed algorithm and visualize the distribution of selected instances, and the conducted experiments on four real data sets compared with three closely related methods on test accuracy and compression ratio demonstrate the effectiveness of the proposed algorithm. Furthermore, the two implementations on Hadoop and Spark are also experimentally compared. The experimental results show that the proposed algorithm provides excellent performance and outperforms the three methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call