On the statistical assessment of classifiers using DNA microarray data.

N Ancona,M Carella,R Maglietta,A Piepoli,M Savino,A D'Addabbo,F Perri,G Pesole,S Liuni,R Cotugno

doi:10.1186/1471-2105-7-387

Abstract

BackgroundIn this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data set is made up of normal (22) and tumor (25) specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data.ResultsWe estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA) classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045) as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS) and Support Vector Machines (SVM) classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035) and e = 18% (p = 0.037) respectively. Moreover, the error rate decreases as the training set size increases, reaching its best performances with 35 training examples. In this case, RLS and SVM have error rates of e = 14% (p = 0.027) and e = 11% (p = 0.019). Concerning the number of genes, we found about 6000 genes (p < 0.05) correlated with the pathology, resulting from the signal-to-noise statistic. Moreover the performances of RLS and SVM classifiers do not change when 74% of genes is used. They progressively reduce up to e = 16% (p < 0.05) when only 2 genes are employed. The biological relevance of a set of genes determined by our statistical analysis and the major roles they play in colorectal tumorigenesis is discussed.ConclusionsThe method proposed provides statistically significant answers to precise questions relevant for the diagnosis and prognosis of cancer. We found that, with as few as 15 examples, it is possible to train statistically significant classifiers for colon cancer diagnosis. As for the definition of the number of genes sufficient for a reliable classification of colon cancer, our results suggest that it depends on the accuracy required.

Highlights

In this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles
With as few as 15 examples, it is possible to train statistically significant classifiers for colon cancer diagnosis
As for the definition of the number of genes sufficient for a reliable classification of colon cancer, our results suggest that it depends on the accuracy required

Summary

Results

Data set description Study population Twenty-five patients (14 males; mean age: 60 ± 14 years), who underwent colonic resection for colorectal cancer (CRC) at CSS hospital, were prospectively recruited into this study. The second step of the procedure consists in evaluating, for every g, the statistical significance of the error rate eg For this purpose, for every random split of S, s2 random permutations of the labels of examples in the reduced training set D n are performed. Every point (x, y) of the curve denoted 1% (5%) in figure 3 represents the number y of genes g having TS2N(g) ≥ x with p-value p ≤ 1% (5%) In this analysis we carried out 1000 random permutations of the labels of the RLS SVM WVA. FNtc(1uuiu%mgrmvuoearbrn)eedtari3ns5osd%ufiegncseudndraevetsetaesms)remoftorsinrewedhdiitfgfhiehnrlyteahneetdxoavpcamrtleulsyeasslepdeoarfitmntahuaset)eeTntdSo(2ocrNlmbasstaesalrtlaviasnebtdieclbs) Number of genes more highly expressed in a) normal and b) tumor tissues determined in the actual data set (observed curve) and in data sets with randomly permuted class labels (1% and 5% curves) for different values of the TS2N statistics. Loss of expression of KLF4 is associated with cancer progression [44]

Conclusions

Background

Discussion and conclusions

10. Vapnik V

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 19, 2006
Citations: 108	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

On the statistical assessment of classifiers using DNA microarray data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Estimating the statistical significance of classifiers by varying the number of genes
R Maglietta ... M Savino
-
R Maglietta, et. al.R Maglietta ... M Savino
01 May 2006
01 May 2006

Regularized Least Squares Cancer Classifiers from DNA microarray data
Nicola Ancona ... Graziano Pesole
BMC Bioinformatics | VOL. 6
Nicola Ancona, et. al.Nicola Ancona ... Graziano Pesole
01 Dec 2005
BMC Bioinformatics | VOL. 6

Wavelet based Extraction of Features from EEG Signals and Classification of Novel Emotion Recognition Using SVM and HMM Classifier and to Measure its Accuracy
M Mohanambal ... Dr.P Vishnu Vardhan
Alinteri Journal of Agriculture Sciences | VOL. 36
M Mohanambal, et. al.M Mohanambal ... Dr.P Vishnu Vardhan
29 Jun 2021
Alinteri Journal of Agriculture Sciences | VOL. 36

On the proliferation of support vectors in high dimensions* *This article is an updated version of: Hsu D, Muthukumar V and Xu J L 2021 On the proliferation of support vectors in high dimensions Proc. 24th Int. Conf. Artificial Intelligence and Statistics vol 130 ed Banerjee A and Fukumizu K pp 91–9.
Daniel Hsu ... Vidya Muthukumar
Journal of Statistical Mechanics: Theory and Experiment | VOL. 2022
Daniel Hsu, et. al.Daniel Hsu ... Vidya Muthukumar
01 Nov 2022
Journal of Statistical Mechanics: Theory and Experiment | VOL. 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the statistical assessment of classifiers using DNA microarray data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics