Abstract

In data analysis, data scientists usually focus on the size of data instead of features selection. Owing to the extreme growth of internet resources data are growing exponentially with more features, which leads to big data dimensionality problems. The high volume of features contains much of redundant data, which may affect the feature classification in terms of accuracy. In the current scenario, feature selection attracts the research community to identify and to remove irrelevant features with more scalability and accuracy. To accommodate this, in this research study, we present a novel feature selection framework that is implemented on Hadoop and Apache Spark platform. In contrast, the proposed model also includes rough sets and differential evolution (DE) algorithm, where rough sets are used to find the minimum features, but rough sets do not consider the degree of overlying in the data. Therefore, DE algorithm is used to find the most optimal features. The proposed model is studied with Random Forest and Naive Bayes classifiers on five well-known data sets and compared with existing feature selection models presented in the literature. The results show that the proposed model performs well in terms of scalability and accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call