Abstract

The Cancer Genome Atlas (TCGA) is one of the largest biorepositories of digital histology. Deep learning (DL) models have been trained on TCGA to predict numerous features directly from histology, including survival, gene expression patterns, and driver mutations. However, we demonstrate that these features vary substantially across tissue submitting sites in TCGA for over 3,000 patients with six cancer subtypes. Additionally, we show that histologic image differences between submitting sites can easily be identified with DL. Site detection remains possible despite commonly used color normalization and augmentation methods, and we quantify the image characteristics constituting this site-specific digital histology signature. We demonstrate that these site-specific signatures lead to biased accuracy for prediction of features including survival, genomic mutations, and tumor stage. Furthermore, ethnicity can also be inferred from site-specific signatures, which must be accounted for to ensure equitable application of DL. These site-specific signatures can lead to overoptimistic estimates of model performance, and we propose a quadratic programming method that abrogates this bias by ensuring models are not trained and validated on samples from the same site.

Highlights

  • The Cancer Genome Atlas (TCGA) is one of the largest biorepositories of digital histology

  • For breast cancer (BRCA TCGA cohort), all demographic characteristics as well as estrogen receptor status (n = 969), progesterone receptor status (n = 966), HER2 expression (n = 847), PAM50 subtype (n = 914), TP53 mutational status (n = 1004), immune subtype (n = 1002), and 3-year progression-free survival (n = 458)[34] varied significantly between cohorts, with false discovery rate correction and P < 0.05 (Fig. 2). We systematically applied this approach to five other major solid tumor types, and demonstrate that multiple impactful clinical features vary by the site for all tumor subtypes tested—including ALK fusion status in squamous cell lung cancer (LUSC TCGA cohort, n = 155) and lung adenocarcinoma (LUAD TCGA cohort, n = 112) and human papillomavirus (HPV) status in head and neck squamous cell carcinoma (HNSC TCGA cohort, n = 332)—all with P < 0.05 and significant after FDR correction (Supplementary Table 1 and Supplementary Fig. 1)

  • Given the increasing interest in developing survival models based on pathology, stage varied by the site in all cancer subsets tested, and 3-year progression-free survival (PFS) varied across the site in all cancers, except lung and colorectal adenocarcinoma

Read more

Summary

Introduction

The Cancer Genome Atlas (TCGA) is one of the largest biorepositories of digital histology. Site detection remains possible despite commonly used color normalization and augmentation methods, and we quantify the image characteristics constituting this site-specific digital histology signature We demonstrate that these site-specific signatures lead to biased accuracy for prediction of features including survival, genomic mutations, and tumor stage. Deep-learning approaches have been applied to identify less apparent features of interest, including clinical biomarkers such as breast cancer receptor status[4,9], microsatellite instability[10,11], or the presence of pathogenic virus in cancer[12] These approaches have been further extended to infer more complex features of tumor biology directly from histology, including gene expression[13,14,15] and pathogenic mutations[16,17]. Differences in specimen acquisition, staining, digitization, and patient demographics all contribute to a unique site-specific digital histology signature, which could in turn lead to a lack of generalizability of digital imaging models

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call