Abstract Gene expression profiling (GEP) represents an important approach to inform breast cancer treatment. However, access to GEP involves challenges associated with cost, tissue transportation, and turn around time. In this work, we explore the prediction of estrogen receptor gene (ESR1) expression directly from images of hematoxylin and eosin (H&E) stained, formalin-fixed paraffin-embedded (FFPE) breast cancer tissue. Since H&E staining is a fast and inexpensive component of the standard tissue preparation in pathology, this approach is tissue preserving and requires no additional tissue processing. Our method uses a deep multiple instance learning approach to process cropped image patches from whole-slide images (WSI) into a numeric embedding vector summarizing the information in each patch. A gated attention mechanism then aggregates these embeddings into a single prediction for the WSI. We train and tune the model on a site-based split of The Cancer Genome Atlas (TCGA) BRCA dataset, and evaluate it on both a heldout split of TCGA (independent sites) and a separate dataset from a tertiary teaching hospital (TTH). All splits of TCGA have ESR1 value, immunohistochemistry (IHC) estrogen receptor (ER) status, and limited clinical outcome data. The TTH dataset has IHC-based ER status and clinical outcome, but not ESR1 expression. On the TCGA heldout test split, our model’s root mean square error (RMSE) for predicting normalized gene expression counts (TPM) was 2.90 [95% CI: 2.57, 3.23], and the Pearson correlation was 0.57 [95%CI; 0.46, 0.67]. For predicting IHC-based ER status on the same TCGA split, this weakly-supervised ESR1-predicting model had an area under the receiver-operator curve (AUROC) of 0.81 [0.74, 0.87]. This was comparable to a strongly-supervised method directly predicting ER status (AUROC: 0.85 [0.77, 0.92]). Lastly, when evaluated for association with patient outcomes (progression-free interval; PFI) using the independent TTH dataset, this ESR1-predicting model had a concordance index (c-index) of 0.59 [0.52, 0.65]. For comparison, the c-index for PFI using the IHC-based ER status for these cases was 0.61 [0.54, 0.66]. This work further demonstrates the potential to infer gene expression from H&E stained images in a manner that shows meaningful associations with clinical variables. Because obtaining H&E stained images is substantially easier and faster than genetic testing, the capability to derive molecular genetic information from these images may increase access to this type of information for patient risk stratification and provide research insights into molecular-morphological associations. Future work incorporating more comprehensive sets of genes remains a valuable next step. Citation Format: Anvita A. Srinivas, Ronnachai Jaroensri, Ellery Wulczyn, James H. Wren, Elaine E. Thompson, Niels Olson, Fabien Beckers, Melissa Miao, Yun Liu, Cameron Chen, David F. Steiner. Estrogen receptor gene expression prediction from H&E-stained whole slide images. [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 5357.
Read full abstract