Feature selection method based on support vector machine and shape analysis for high-throughput medical data

Qiong Liu,Qiong Gu,Zhao Wu

doi:10.1016/j.compbiomed.2017.10.008

Abstract

Proteomics data analysis based on the mass-spectrometry technique can provide a powerful tool for early diagnosis of tumors and other diseases. It can be used for exploring the features that reflect the difference between samples from high-throughput mass spectrometry data, which are important for the identification of tumor markers. Proteomics mass spectrometry data have the characteristics of too few samples, too many features and noise interference, which pose a great challenge to traditional machine learning methods. Traditional unsupervised dimensionality reduction methods do not utilize the label information effectively, so the subspaces they find may not be the most separable ones of the data. To overcome the shortcomings of traditional methods, in this paper, we present a novel feature selection method based on support vector machine (SVM) and shape analysis. In the process of feature selection, our method considers not only the interaction between features but also the relationship between features and class labels, which improves the classification performance. The experimental results obtained from four groups of proteomics data show that, compared with traditional unsupervised feature extraction methods (i.e., Principal Component Analysis - Procrustes Analysis, PCA-PA), our method not only ensures that fewer features are selected but also ensures a high recognition rate. In addition, compared with the two kinds of multivariate filter methods, i.e., Max-Relevance Min-Redundancy (MRMR) and Fast Correlation-Based Filter (FCBF), our method has a higher recognition rate.

Full Text