A dataset that has massive features and imbalanced classes may be challenging for obtaining adequate accuracy in classification approaches of Machine Learning (ML). The purpose of this research is to find the optimal feature subset for cervical cancer diagnosis with efficient classification approach by estimating the performance of various Machine Learning predictive models. Filter-based feature selection techniques of Relief and Information Gain are applied in this study to calculate the rank for each feature that can be applied to order and select highest scoring features for feature selection. An optimal feature subset is generated with wrapper approach through Recursive Feature Elimination which uses a Random Forest procedure and Genetic Algorithm has been employed based on evolutionary principle. The predictive models are established with 10fold cross validation using prevalent classification algorithms like Random Forest, C5.0, K-Nearest Neighbour and Naïve Bayes. The results showed an enhancement in the average performance of these classifiers concurrently and the classification error for these classifiers decreases substantially. The experiments also exhibited that by employing this approach an optimal and reduced feature subset is desirable for the enrichment of classification accuracy with a lower computational cost. The features generated by fused approach of Relief and Genetic algorithm methods were able to predict the results in an efficient manner, hence an optimal feature subset has been nominated through this procedure. Maximum number of classifiers have shown good results in terms of performance outcomes. In addition, Random Forest method has shown advanced accuracy rate with an improved percentage of sensitivity and specificity results. Also, this work established that the best and optimal feature subset selection through Fused Feature Selection (FFS) approach could reduce the complexity of the predictive model.
Read full abstract