Feature Selection From Gene Expression Data Using Simulated Annealing and Partial Least Squares Regression Coefficients

Nimrita Koul,Sunilkumar S Manvi

doi:10.1016/j.gltp.2022.03.001

Abstract

Accurate characterization of the molecular nature of a tumour is important for its effective treatment. Therefore, the classification of tumours is an important research problem. The application of data science and machine learning techniques to the gene-expression data has enabled computational researchers to separate the gene-expression samples into different classes based on the difference in gene-expression patterns. This has also facilitated the discovery of new classes and new disease biomarkers. However, gene-expression data is very high-dimensional and noisy. The number of features is high in comparison to the number of samples. The classes in the data are often imbalanced. Out of thousands of genes, only a few are relevant to the disease. The machine learning approaches for the classification of gene-expression samples need to address all these issues to obtain reliable performance. This paper proposed a method using simulated annealing and partial least squares regression for gene selection from six open-source microarray cancer gene-expression datasets. Selected subset of genes was used to fit support-vector machines, random-forest, voting-classifiers, and multilayer-perceptron classifiers. A comparison with existing methods shows the superior performance of the proposed method.

Full Text