Cancerclass: AnRPackage for Development and Validation of Diagnostic Tests from High-Dimensional Molecular Data

Jan Budczies,Christian Von Törne,Silvia Drab-Esfahani,Daniel Kosztyla,Carsten Denkert,Albrecht Stenzinger,Manfred Dietel

doi:10.18637/jss.v059.i01

Jan Budczies, Christian Von Törne + Show 5 more

Open Access

https://doi.org/10.18637/jss.v059.i01

Copy DOI

Export

Save

Cite

Journal: Journal of Statistical Software	Publication Date: Jan 1, 2014
Citations: 14	License type: cc-by

Abstract
Highlights/Summary
Full-Text
Similar Papers

Abstract

Listen

Progress in molecular high-throughput techniques has led to the opportunity of a comprehensive monitoring of biomolecules in medical samples. In the era of personalized medicine, these data form the basis for the development of diagnostic, prognostic and predictive tests for cancer. Because of the high number of features that are measured simultaneously in a relatively low number of samples, supervised learning approaches are sensitive to overfitting and performance overestimation. Bioinformatic methods were developed to cope with these problems including control of accuracy and precision. However, there is demand for easy-to-use software that integrates methods for classifier construction, performance assessment and development of diagnostic tests. To contribute to filling of this gap, we developed a comprehensive R package for the development and validation of diagnostic tests from high-dimensional molecular data. An important focus of the package is a careful validation of the classification results. To this end, we implemented an extended version of the multiple random validation protocol, a validation method that was introduced before. The package includes methods for continuous prediction scores. This is important in a clinical setting, because scores can be converted to probabilities and help to distinguish between clear-cut and borderline classification results. The functionality of the package is illustrated by the analysis of two cancer microarray data sets.

Highlights

Progress in molecular high-throughput techniques has led to the opportunity of simultaneous monitoring of hundreds or thousands of biomolecules in medical samples, e.g. using microarrays
Using a protocol similar to [1] we investigate the dependence of classification accuracy on the number of features (Fig. 1):
The confidence interval of the overall classification rate is estimated from 200 random splits in training and test sets

Summary

Introduction

Progress in molecular high-throughput techniques has led to the opportunity of simultaneous monitoring of hundreds or thousands of biomolecules in medical samples, e.g. using microarrays. Because of the high dimensionality of the data and connected to the multiple testing problem, the development of molecular tests is sensitive to model overfitting and performance overestimation. Bioinformatic methods have been developed to cope with these problems, e.g. the multiple random validation protocol that was presented in [1]. Cancerclass integrates methods for development and validation of diagnostic tests from high-dimensional molecular data. Simple classifiers were shown to have a good performance on high-dimensional data compared to more sophisticated methods [2]. The accuracy of the predictor can be evaluated using training and test set validation, leave-one-out cross-validation or in a multiple random validation protocol. The functionality of cancerclass is illustrated using two sets of cancer gene expression data. Gene expression data of breast cancer with good and poor prognosis [4, 5] are obtained from the ExperimentData package cancerdata

Multiple random validation protocol

Number of genes

Predictor construction and validation

Findings

NODM DM

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Cancerclass: AnRPackage for Development and Validation of Diagnostic Tests from High-Dimensional Molecular Data

Abstract

Highlights

Summary

Published Version

Talk to us

Similar Papers

More From: Journal of Statistical Software

Lead the way for us

Similar Papers

Validation of high throughput methods for tissue disruption and nucleic acid extraction for ranaviruses (family Iridoviridae)
Anneke E Rimmer ... Richard J Whittington
Aquaculture | VOL. 338-341
Anneke E Rimmer, et. al.Anneke E Rimmer ... Richard J Whittington
27 Jan 2012
Aquaculture | VOL. 338-341

The Development of a Four Tier-Based Diagnostic Test Diagnostic Assessment on Science Concept Course
F Fakhriyah ... S Masfuah
Journal of Physics: Conference Series | VOL. 1842
F Fakhriyah, et. al.F Fakhriyah ... S Masfuah
01 Mar 2021
Journal of Physics: Conference Series | VOL. 1842

A Four-tier Test to Identify Students’ Conceptions in Inheritance Concepts
Noviah Rosa Firdaus ... Tjandra Kirana
IJORER : International Journal of Recent Educational Research | VOL. 2
Noviah Rosa Firdaus, et. al. Noviah Rosa Firdaus ... Tjandra Kirana
31 Jul 2021
IJORER : International Journal of Recent Educational Research | VOL. 2

Nucleic acid reference materials (NARMs): definitions and issues
Deborah A Payne ... Heinz Schimmel
cclm | VOL. 48
Deborah A Payne, et. al.Deborah A Payne ... Heinz Schimmel
25 Oct 2010
cclm | VOL. 48

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Cancerclass: AnRPackage for Development and Validation of Diagnostic Tests from High-Dimensional Molecular Data

Abstract

Highlights

Summary

Published Version

Talk to us

Similar Papers

More From: Journal of Statistical Software