SLNL: A novel method for gene selection and phenotype classification

Haihui Huang,Yong Liang,Naiqi Wu,Xindong Peng,Jun Shu

doi:10.1002/int.22844

Abstract

One of the central tasks of genome research is to predict phenotypes and discover some important gene biomarkers. However, there are three main problems in analyzing genomics data to predict phenotypes and gene marker selection. Such as large p and small n, low reproducibility of the selected biomarkers, and high noise. To provide a unified solution to alleviate the problems as mentioned above, we propose a self-paced learning L 1 / 2 ${{\rm{L}}}_{1/2}$ absolute network-based logistic regression model, called SLNL. Through the L 1 / 2 ${L}_{1/2}$ regularization, the model can get a more sparse result, which provides better interpretability. The absolute network-based penalty enables the model to integrate the feature network knowledge and helps select higher reproducibility genes. Moreover, this proposed penalty overcomes the drawback of a traditional network penalty without considering the sign of the coefficient. By the self-paced learning strategy, the model can now consider the noise level in gene expression data, lower the impact of high noise samples in data to model training, and provide better prediction accuracy. We compare the proposed method with six alternative approaches in various experimental scenarios, including a comprehensive simulation, four benchmark gene expression datasets, one lung cancer data set, and three lung cancer validation sets. Results show that SLNL can identify fewer meaningful biomarkers and obtain the best or equivalent prediction performance. Moreover, biological analysis shows that the genes selected by the SLNL might be helpful to tumor diagnosis and treatment.

Full Text