Abstract
In this paper, we compare the performance of six different feature selection methods for LC-MS-based proteomics and metabolomics biomarker discovery-t test, the Mann-Whitney-Wilcoxon test (mww test), nearest shrunken centroid (NSC), linear support vector machine-recursive features elimination (SVM-RFE), principal component discriminant analysis (PCDA), and partial least squares discriminant analysis (PLSDA)-using human urine and porcine cerebrospinal fluid samples that were spiked with a range of peptides at different concentration levels. The ideal feature selection method should select the complete list of discriminating features that are related to the spiked peptides without selecting unrelated features. Whereas many studies have to rely on classification error to judge the reliability of the selected biomarker candidates, we assessed the accuracy of selection directly from the list of spiked peptides. The feature selection methods were applied to data sets with different sample sizes and extents of sample class separation determined by the concentration level of spiked compounds. For each feature selection method and data set, the performance for selecting a set of features related to spiked compounds was assessed using the harmonic mean of the recall and the precision (f-score) and the geometric mean of the recall and the true negative rate (g-score). We conclude that the univariate t test and the mww test with multiple testing corrections are not applicable to data sets with small sample sizes (n = 6), but their performance improves markedly with increasing sample size up to a point (n > 12) at which they outperform the other methods. PCDA and PLSDA select small feature sets with high precision but miss many true positive features related to the spiked peptides. NSC strikes a reasonable compromise between recall and precision for all data sets independent of spiking level and number of samples. Linear SVM-RFE performs poorly for selecting features related to the spiked compounds, even though the classification error is relatively low.
Highlights
We compare the performance of six different feature selection methods for liquid chromatography–mass spectrometry (LC-MS)-based proteomics and metabolomics biomarker discovery—t test, the Mann–Whitney–Wilcoxon test, nearest shrunken centroid (NSC), linear support vector machine– recursive features elimination (SVM-recursive feature elimination (RFE)), principal component discriminant analysis (PCDA), and partial least squares discriminant analysis (PLSDA)— using human urine and porcine cerebrospinal fluid samples that were spiked with a range of peptides at different concentration levels
The ideal feature selection method should select the complete list of discriminating features that are related to the spiked peptides without selecting unrelated features
Comparison between Methods— all methods benefit from a larger sample size, only some of them are affected by the between- and within-class variability of spiked peptides
Summary
More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverneamendment. The feature selection methods were applied to data sets with different sample sizes and extents of sample class separation determined by the concentration level of spiked compounds. The success of biomarker discovery depends on several factors: consistent and reproducible phenotyping of the individuals from whom biological samples are obtained; the quality of the analytical methodology, which in turn determines the quality of the collected data; the accuracy of the computational methods used to extract quantitative and molecular identity information to define the biomarker candidates from raw analytical data; and the performance of the applied statistical methods in the selection of a limited list of compounds with the potential to discriminate between predefined classes of samples. The goal of subsequent data preprocessing and statistical analysis is to select a limited number of candidates, which are subsequently subjected to targeted analyses in large number of samples for validation
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have