Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis

Habil Zare,Ryan R Brinkman,Arvind Gupta,Gholamreza Haffari

doi:10.1186/1471-2164-14-s1-s14

Abstract

One challenge in applying bioinformatic tools to clinical or biological data is high number of features that might be provided to the learning algorithm without any prior knowledge on which ones should be used. In such applications, the number of features can drastically exceed the number of training instances which is often limited by the number of available samples for the study. The Lasso is one of many regularization methods that have been developed to prevent overfitting and improve prediction performance in high-dimensional settings. In this paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. Our approach is to generate several samples from the training data by bootstrapping, determine the best relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical evaluations on six real datasets from different fields to confirm the superiority of our method in exploratory data analysis and prediction performance. For example, we applied FeaLect, our feature scoring algorithm, to a lymphoma dataset, and according to a human expert, our method led to selecting more meaningful features than those commonly used in the clinics. This case study built a basis for discovering interesting new criteria for lymphoma diagnosis. Furthermore, to facilitate the use of our algorithm in other applications, the source code that implements our algorithm was released as FeaLect, a documented R package in CRAN.

Highlights

To build a robust classifier, the number of training instances is usually required to be more than the number of features
The features selected by the Lasso depends on the regularization parameter, and the set of solutions for all values of this free parameter is provided by regularization path [2]
Efficient algorithms exist for recovering the whole regularization path for the Lasso [3], finding a subset of highly relevant features which leads to a robust predictor is a prominent research question

Summary

Introduction

To build a robust classifier, the number of training instances is usually required to be more than the number of features. For each bootstrap sample, Bolasso considers only one model that minimizes the training objective L in eqn (1), whereas we include information provided by the whole regularization path, Instead of making a binary decision of inclusion or exclusion, we compute a score value for each feature that can help the user to select the more relevant ones, While Bolasso-S relies on a threshold, our theoretical study of the behaviour of irrelevant features leads to an analytical criterion for feature selection without using any pre-defined parameter. The best AUC is reported by testing on a set of validating samples disjoint from the training set For both lymphoma and colon datasets, the performance of the optimum classifier decreases if all features are provided to lars. This observation reassures that FeaLect is advantageous over lars in high-dimensional settings and their performance converges as “adequate” number of samples are provided (Figures 4 and 5)

Conclusion

Tibshirani R

Bach F

Findings

13. Breiman L

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Jan 1, 2013
Citations: 52	License type: cc-by

R Discovery Prime

R Discovery Prime

Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Commentary: Reporting standards are needed for evaluations of risk reclassification
M S Pepe ... H Janes
International Journal of Epidemiology | VOL. 40
M S Pepe, et. al.M S Pepe ... H Janes
13 May 2011
International Journal of Epidemiology | VOL. 40

Kurtosis-Based Feature Selection Method using Symmetric Uncertainty to Predict the Air Quality Index
Usharani Bhimavarapu ... M Sreedevi
Computer Science Journal of Moldova | VOL. 30
Usharani Bhimavarapu, et. al.Usharani Bhimavarapu ... M Sreedevi
01 Dec 2022
Computer Science Journal of Moldova | VOL. 30

Structure Regularized Unsupervised Discriminant Feature Analysis
Mingyu Fan ... Xiaojun Chang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 31
Mingyu Fan, et. al.Mingyu Fan ... Xiaojun Chang
13 Feb 2017
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 31

FTDZOA: An Efficient and Robust FS Method with Multi-Strategy Assistance.
Fuqiang Chen ... Rongxiang Xie
Biomimetics (Basel, Switzerland) | VOL. 9
Fuqiang Chen, et. al.Fuqiang Chen ... Rongxiang Xie
17 Oct 2024
Biomimetics (Basel, Switzerland) | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics