Abstract
BackgroundThe primary objectives of this paper are: 1.) to apply Statistical Learning Theory (SLT), specifically Partial Least Squares (PLS) and Kernelized PLS (K-PLS), to the universal "feature-rich/case-poor" (also known as "large p small n", or "high-dimension, low-sample size") microarray problem by eliminating those features (or probes) that do not contribute to the "best" chromosome bio-markers for lung cancer, and 2.) quantitatively measure and verify (by an independent means) the efficacy of this PLS process. A secondary objective is to integrate these significant improvements in diagnostic and prognostic biomedical applications into the clinical research arena. That is, to devise a framework for converting SLT results into direct, useful clinical information for patient care or pharmaceutical research. We, therefore, propose and preliminarily evaluate, a process whereby PLS, K-PLS, and Support Vector Machines (SVM) may be integrated with the accepted and well understood traditional biostatistical "gold standard", Cox Proportional Hazard model and Kaplan-Meier survival analysis methods. Specifically, this new combination will be illustrated with both PLS and Kaplan-Meier followed by PLS and Cox Hazard Ratios (CHR) and can be easily extended for both the K-PLS and SVM paradigms. Finally, these previously described processes are contained in the Fine Feature Selection (FFS) component of our overall feature reduction/evaluation process, which consists of the following components: 1.) coarse feature reduction, 2.) fine feature selection and 3.) classification (as described in this paper) and prediction.ResultsOur results for PLS and K-PLS showed that these techniques, as part of our overall feature reduction process, performed well on noisy microarray data. The best performance was a good 0.794 Area Under a Receiver Operating Characteristic (ROC) Curve (AUC) for classification of recurrence prior to or after 36 months and a strong 0.869 AUC for classification of recurrence prior to or after 60 months. Kaplan-Meier curves for the classification groups were clearly separated, with p-values below 4.5e-12 for both 36 and 60 months. CHRs were also good, with ratios of 2.846341 (36 months) and 3.996732 (60 months).ConclusionsSLT techniques such as PLS and K-PLS can effectively address difficult problems with analyzing biomedical data such as microarrays. The combinations with established biostatistical techniques demonstrated in this paper allow these methods to move from academic research and into clinical practice.
Highlights
One of the most popular and challenging topics in bioinformatics research is gene selection from microarray data because it involves both statistical processing as well as biological interpretation
Very large Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) values can be achieved on both training and validation data sets, but the results provided by these trained Complex Adaptive Systems (CAS) frequently fail to generalize to data sets other than training and validation sets
Our microarray analysis and information extraction method comprised three basic components drawing from Statistical Learning Theory: 1.) Coarse Feature Reduction, 2.) Fine Feature Selection and 3.) Classification
Summary
One of the most popular and challenging topics in bioinformatics research is gene selection from microarray data because it involves both statistical processing as well as biological interpretation. Song demonstrates that the BAHSIC algorithm encompasses the following well-known feature selection algorithms: (1) Pearson’s correlation coefficient [3,4], (2) t-test [5], (3) signal-to-noise ratio [6], (4) Centroid [7,8], (5) Shrunken Certroid [9,10], and (6) ridge regression [11] These collective results suggest that the Evolutionary Programming driven Support Vector Machine (EP-SVM) [12,13] with a choice of similarity, sum and product kernels might be a good wrapper/classification candidate for gene selection. Those patients that are at a high risk of recurrence might benefit from post-operative adjuvant chemotherapy, whereas those patients that are at a low risk can be spared the side effects of chemotherapy [18]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have