Abstract

BackgroundFeature selection is a crucial step in machine learning analysis. Currently, many feature selection approaches do not ensure satisfying results, in terms of accuracy and computational time, when the amount of data is huge, such as in ‘Omics’ datasets.ResultsHere, we propose an innovative implementation of a genetic algorithm, called GARS, for fast and accurate identification of informative features in multi-class and high-dimensional datasets. In all simulations, GARS outperformed two standard filter-based and two ‘wrapper’ and one embedded’ selection methods, showing high classification accuracies in a reasonable computational time.ConclusionsGARS proved to be a suitable tool for performing feature selection on high-dimensional data. Therefore, GARS could be adopted when standard feature selection approaches do not provide satisfactory results or when there is a huge amount of data to be analyzed.

Highlights

  • Feature selection is a crucial step in machine learning analysis

  • We found that the selected features by GARS were robust, as the error rate on the validation test sets was consistently low for GARS and obtained with the lower number of features selected compared to the other methods

  • While we do not presume to have covered here the full range of options for performing feature selection on high-dimensional data, we believe that our test suggests GARS as a powerful and convenient resource for timely performance of an effective and robust collection of informative features in high-dimensions

Read more

Summary

Introduction

Feature selection is a crucial step in machine learning analysis. Currently, many feature selection approaches do not ensure satisfying results, in terms of accuracy and computational time, when the amount of data is huge, such as in ‘Omics’ datasets. The feature selection (FS) step seeks to pinpoint the most informative variables from data to build robust classification models. This becomes crucial in the Omics data era, as the combination of highdimensional data with information from various sources (clinical and environmental) enables researchers to study complex diseases such as cancer or cardiovascular disease in depth [1,2,3,4]. Chiesa et al BMC Bioinformatics (2020) 21:54 optimize a problem by improving iteratively the solution based on a given heuristic function, whereas hybrid methods are a sequential combination of different FS approaches, for example those based on filter and wrapper methods [9]. To find the optimal solution this scheme is repeated several times until the population has converged, i.e., new offspring are not significantly different from the previous generation

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.