Artificial image objects for classification of breast cancer biomarkers with transcriptome sequencing data and convolutional neural network algorithms

Xiangning Chen,Justin M Balko,Daniel G Chen,Zhongming Zhao,Jingchun Chen

doi:10.1186/s13058-021-01474-z

Abstract

BackgroundTranscriptome sequencing has been broadly available in clinical studies. However, it remains a challenge to utilize these data effectively for clinical applications due to the high dimension of the data and the highly correlated expression between individual genes.MethodsWe proposed a method to transform RNA sequencing data into artificial image objects (AIOs) and applied convolutional neural network (CNN) algorithms to classify these AIOs. With the AIO technique, we considered each gene as a pixel in an image and its expression level as pixel intensity. Using the GSE96058 (n = 2976), GSE81538 (n = 405), and GSE163882 (n = 222) datasets, we created AIOs for the subjects and designed CNN models to classify biomarker Ki67 and Nottingham histologic grade (NHG).ResultsWith fivefold cross-validation, we accomplished a classification accuracy and AUC of 0.821 ± 0.023 and 0.891 ± 0.021 for Ki67 status. For NHG, the weighted average of categorical accuracy was 0.820 ± 0.012, and the weighted average of AUC was 0.931 ± 0.006. With GSE96058 as training data and GSE81538 as testing data, the accuracy and AUC for Ki67 were 0.826 ± 0.037 and 0.883 ± 0.016, and that for NHG were 0.764 ± 0.052 and 0.882 ± 0.012, respectively. These results were 10% better than the results reported in the original studies. For Ki67, the calls generated from our models had a better power for prediction of survival as compared to the calls from trained pathologists in survival analyses.ConclusionsWe demonstrated that RNA sequencing data could be transformed into AIOs and be used to classify Ki67 status and NHG with CNN algorithms. The AIO method could handle high-dimensional data with highly correlated variables, and there was no need for variable selection. With the AIO technique, a data-driven, consistent, and automation-ready model could be developed to classify biomarkers with RNA sequencing data and provide more efficient care for cancer patients.

Highlights

Transcriptome sequencing has been broadly available in clinical studies
We found that a convolutional neural network (CNN) architecture (Fig. 2) with six 3 × 3 convolutional layers followed with one 1 × 1 convolutional layer and four fully connected layers produced good testing accuracy
We evaluated the predictive power of the calls produced from our CNN models by comparing it to that of the consensus calls from trained pathologists

Summary

Introduction

Transcriptome sequencing has been broadly available in clinical studies. it remains a challenge to utilize these data effectively for clinical applications due to the high dimension of the data and the highly correlated expression between individual genes. Breast cancer is a complex disease; early detection and evaluation of the tumor are critical for prognosis and long-term survival. Assessment of the proliferation antigen Ki67 is increasingly recommended [1, 2] These biomarkers provide valuable prognostic information for survival and treatment outcomes [3, 4]. Current approaches to evaluate these biomarkers, i.e., immunohistochemistry stains, require careful assessments by trained pathologists, and disagreements between the pathologists are often observed, especially for NHG and Ki67. Other technical factors, such as sample fixation, antibody batches, and scoring methods, contribute to the inconsistent results. To obtain a consistent assessment, more robust methods that are amendable to automation are highly desirable

Objectives

Methods

Results

Discussion

Conclusion