Optimal instance subset selection from big data using genetic algorithm and open source framework

Junhai Zhai,Dandan Song

doi:10.1186/s40537-022-00640-0

Abstract

Data is accumulating at an incredible rate, and the era of big data has arrived. Big data brings great challenges to traditional machine learning algorithms, it is difficult for learning tasks in big data scenario to be completed on stand-alone. Data reduction is an effective way to solve this problem. Data reduction includes attribute reduction and instance reduction. In this study, we focus on instance reduction also called instance selection, and view the instance selection as an optimal instance subset selection problem. Inspired by the ideas of cross validation and divide and conquer, we defined a novel criterion called combined information entropy with respect to a set of classifiers to measure the importance of an instance subset, the criterion uses multiple independent classifiers trained on different subsets to measure the optimality of an instance subset. Based on the criterion, we proposed an approach which uses genetic algorithm and open source framework to select optimal instance subset from big data. The proposed algorithm is implemented on two open source big data platforms Hadoop and Spark, the conducted experiments on four artificial data sets demonstrate the feasibility of the proposed algorithm and visualize the distribution of selected instances, and the conducted experiments on four real data sets compared with three closely related methods on test accuracy and compression ratio demonstrate the effectiveness of the proposed algorithm. Furthermore, the two implementations on Hadoop and Spark are also experimentally compared. The experimental results show that the proposed algorithm provides excellent performance and outperforms the three methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Jul 5, 2022
Citations: 2	License type: open-access

R Discovery Prime

R Discovery Prime

Optimal instance subset selection from big data using genetic algorithm and open source framework

Abstract

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

개선된 배깅 앙상블을 활용한 기업부도예측
Sung-Hwan Min
Journal of Intelligence and Information Systems | VOL. 20
Sung-Hwan MinSung-Hwan Min
30 Dec 2015
Journal of Intelligence and Information Systems | VOL. 20

Legal Governance of Brain Data Derived from Artificial Intelligence
Mahika Ahluwalia
Voices in Bioethics | VOL. 7
Mahika AhluwaliaMahika Ahluwalia
02 Jun 2021
Voices in Bioethics | VOL. 7

Active Learning With Optimal Instance Subset Selection
Yifan Fu ... Xingquan Zhu
IEEE Transactions on Cybernetics | VOL. 43
Yifan Fu, et. al.Yifan Fu ... Xingquan Zhu
07 Mar 2013
IEEE Transactions on Cybernetics | VOL. 43

Sequential circuit test generation in a genetic algorithm framework
Elizabeth M Rudnick ... Janak H Patel
-
Elizabeth M Rudnick, et. al.Elizabeth M Rudnick ... Janak H Patel
01 Jan 1993
01 Jan 1993

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimal instance subset selection from big data using genetic algorithm and open source framework

Abstract

Talk to us

Similar Papers

More From: Journal of Big Data