Abstract

Sequencing-based molecular characterization of tumors provides information required for individualized cancer treatment. There are well-defined molecular subtypes of breast cancer that provide improved prognostication compared to routine biomarkers. However, molecular subtyping is not yet implemented in routine breast cancer care. Clinical translation is dependent on subtype prediction models providing high sensitivity and specificity. In this study we evaluate sample size and RNA-sequencing read requirements for breast cancer subtyping to facilitate rational design of translational studies. We applied subsampling to ascertain the effect of training sample size and the number of RNA sequencing reads on classification accuracy of molecular subtype and routine biomarker prediction models (unsupervised and supervised). Subtype classification accuracy improved with increasing sample size up to N = 750 (accuracy = 0.93), although with a modest improvement beyond N = 350 (accuracy = 0.92). Prediction of routine biomarkers achieved accuracy of 0.94 (ER) and 0.92 (Her2) at N = 200. Subtype classification improved with RNA-sequencing library size up to 5 million reads. Development of molecular subtyping models for cancer diagnostics requires well-designed studies. Sample size and the number of RNA sequencing reads directly influence accuracy of molecular subtyping. Results in this study provide key information for rational design of translational studies aiming to bring sequencing-based diagnostics to the clinic.

Highlights

  • Breast cancer is the second most common cancer worldwide and a leading cause of cancer-related mortality in women

  • In this study we investigate the dependency of prediction accuracy on the training sample size using The Cancer Genome Atlas (TCGA) Data set[33]

  • The performance of Elastic net and Random forest classifiers improved with increasing training sample size, which is typically expected in classification problems with noisy data and not perfect class separation (Fig. 1)

Read more

Summary

Introduction

Breast cancer is the second most common cancer worldwide and a leading cause of cancer-related mortality in women. The classification system based on gene-expression data is known to provide improved prognosis over routine pathology[8,9,10,11,12]. Their clinical utility has been demonstrated with respect to personalized treatment in chemotherapy and hormonal therapy[13,14,15,16]. There have not been any comprehensive studies that assess sample size requirements for molecular subtyping of breast cancer using RNAseq-based gene expression profiling and considering larger datasets. Our results are directly relevant for anyone designing translational studies focused on patient stratification by molecular subtyping in breast cancer, and may be relevant for study design in other cancers

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call