Abstract

BackgroundSelecting a subset of relevant properties from a large set of features that describe a dataset is a challenging machine learning task. In biology, for instance, the advances in the available technologies enable the generation of a very large number of biomarkers that describe the data. Choosing the more informative markers along with performing a high-accuracy classification over the data can be a daunting task, particularly if the data are high dimensional. An often adopted approach is to formulate the feature selection problem as a biobjective optimization problem, with the aim of maximizing the performance of the data analysis model (the quality of the data training fitting) while minimizing the number of features used.ResultsWe propose an optimization approach for the feature selection problem that considers a “chaotic” version of the antlion optimizer method, a nature-inspired algorithm that mimics the hunting mechanism of antlions in nature. The balance between exploration of the search space and exploitation of the best solutions is a challenge in multi-objective optimization. The exploration/exploitation rate is controlled by the parameter I that limits the random walk range of the ants/prey. This variable is increased iteratively in a quasi-linear manner to decrease the exploration rate as the optimization progresses. The quasi-linear decrease in the variable I may lead to immature convergence in some cases and trapping in local minima in other cases. The chaotic system proposed here attempts to improve the tradeoff between exploration and exploitation. The methodology is evaluated using different chaotic maps on a number of feature selection datasets. To ensure generality, we used ten biological datasets, but we also used other types of data from various sources. The results are compared with the particle swarm optimizer and with genetic algorithm variants for feature selection using a set of quality metrics.

Highlights

  • The large amounts of data generated today in biology offer more detailed and useful information on the one hand, but on the other hand, it makes the process of analyzing these data more difficult because not all the information is relevant

  • The results are compared with the particle swarm optimizer and with genetic algorithm variants for feature selection using a set of quality metrics

  • We use ten biological datasets to validate the performance of our method and its potential applicability for data generated in biology

Read more

Summary

Introduction

The large amounts of data generated today in biology offer more detailed and useful information on the one hand, but on the other hand, it makes the process of analyzing these data more difficult because not all the information is relevant. Feature selection (attribute reduction) is a technique for solving classification and regression problems, and it is employed to identify a subset of the features and remove the redundant ones. This mechanism is useful when the number of attributes is large and not all of them are required for describing the data and for further exploring the data attributes in experiments. Using a tumor as a simple example, there are a large number of attributes that describe it: mitotic activity, tumor invasion, tumor shape and size, vascularization, and growth rate, to name just a few All of these attributes require measurements and tests that are not always easy to perform. An often adopted approach is to formulate the feature selection problem as a biobjective optimization problem, with the aim of maximizing the performance of the data analysis model (the quality of the data training fitting) while minimizing the number of features used

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.