Abstract
BackgroundLately, biomarker discovery has become one of the most significant research issues in the biomedical field. Owing to the presence of high-throughput technologies, genomic data, such as microarray data and RNA-seq, have become widely available. Many kinds of feature selection techniques have been applied to retrieve significant biomarkers from these kinds of data. However, they tend to be noisy with high-dimensional features and consist of a small number of samples; thus, conventional feature selection approaches might be problematic in terms of reproducibility.ResultsIn this article, we propose a stable feature selection method for high-dimensional datasets. We apply an ensemble L 1-norm support vector machine to efficiently reduce irrelevant features, considering the stability of features. We define the stability score for each feature by aggregating the ensemble results, and utilize backward feature elimination on a purified feature set based on this score; therefore, it is possible to acquire an optimal set of features for performance without the need to set a specific threshold. The proposed methodology is evaluated by classifying the binary stage of renal clear cell carcinoma with RNA-seq data.ConclusionA comparison with established algorithms, i.e., a fast correlation-based filter, random forest, and an ensemble version of an L 2-norm support vector machine-based recursive feature elimination, enabled us to prove the superior performance of our method in terms of classification as well as stability in general. It is also shown that the proposed approach performs moderately on high-dimensional datasets consisting of a very large number of features and a smaller number of samples. The proposed approach is expected to be applicable to many other researches aimed at biomarker discovery.
Highlights
Biomarker discovery has become one of the most significant research issues in the biomedical field
We propose a stable feature selection method based on the L1-norm support vector machine (SVM)
Datasets The RNA-seq gene expression data of renal clear cell carcinoma was obtained from Broad GDAC Firehose, which is one of the genome data analysis centers of the the cancer genome atlas (TCGA) project [13]
Summary
Biomarker discovery has become one of the most significant research issues in the biomedical field. Many kinds of feature selection techniques have been applied to retrieve significant biomarkers from these kinds of data. They tend to be noisy with high-dimensional features and consist of a small number of samples; conventional feature selection approaches might be problematic in terms of reproducibility. The distinct characteristics of biomedical data, which often contain far more features than the number of samples, mean that conventional feature selection approaches might be problematic, especially with regard to reproducibility; small changes in the dataset could lead to large changes in the feature selection result, which should not be considered as important biomarkers. Feature selection in the biomedical field should consider the stability of features, as well as the influence on the classification performance. The L1-norm of lasso regression tends to force the solution to be sparse, and it shows high efficiency for feature selection in regression problems
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have