Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

Ping Shi,Qifu Zhu,Surajit Ray,Mark A Kon

doi:10.1186/1471-2105-12-375

Ping Shi, Qifu Zhu + Show 2 more

Open Access

https://doi.org/10.1186/1471-2105-12-375

Copy DOI

Abstract

BackgroundThe widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.ResultsWe developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasetsConclusionsThe k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.

Highlights

The widely used k top scoring pair (k-TSP ranking algorithm (TSP)) algorithm is a simple yet powerful parameter-free classifier
We describe a hybrid approach that integrates the TSP scoring algorithm into other machine learning methods such as support vector machine (SVM) and k-nearest neighbors (KNN)
Simulated datasets Simulation process To investigate how feature selection methods respond to different data structures, we generated two types of data, with variations in aspects such as the strength of differentially expressed genes, the sparseness of signal genes, and covariance structure

Summary

Introduction

The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. Many studies have showed it is possible to extract compelling information from microarray data to support clinical decisions on cancer diagnosis, prognosis over-fitting in many machine learning methods This occurs especially in cases when the few training samples are not good representatives of classes, so that the classifier may learn inherent noise from irrelevant features in training data, leading to poor generalizability. Filter approaches [8,9] involve calculating feature relevance scores, and selecting a subset of highscoring features as input to the classifiers after removing low-scoring ones They are computationally efficient and widely used. Whereas filter techniques are frequently univariate in nature, assuming the features are independent and ignoring feature dependencies, wrapper and embedded techniques select features in a multivariate fashion by taking feature correlations into account, an approach which is certainly biologically relevant if we consider, e.g., how genes are co-regulated in pathways

Methods

Results

Discussion

Conclusion