Abstract

Abstract With an estimated 9.6 million deaths in 2018, cancer is a leading cause of human death worldwide. Because the organ and cell type that generated the tumor determine a patient's response to therapies, quick and accurate identification of the primary site of cancer is critical to guide the most effective treatment. However, the current cancer diagnosis procedure is a multi-step process, relying on extensive clinical examination and laboratory testing. Recent years have seen increased interest in somatic point mutation-based cancer classification, which has the potential of differentiating tumors with similar histopathological appearances and delivering more accurate results. Yet there have been only a few studies as researchers face limited accuracy even with deep learning algorithms: the status quo is 65% on 12 tumor classes. The primary challenge is to model complex interactions among various genes: an advanced classifier capable of extracting high-level features is necessary. To address this issue, a novel approach is proposed in this study: Gene Embedding-based Cancer Classification (GEM-CC), which combines gene embedding with somatic mutation data from patients for cancer identification. Specifically, we introduce embeddings of gene expression; thus the new algorithm is capable of harvesting the information in both somatic mutations and gene expressions without the need for obtaining gene expression data from a patient, often not available in a clinical setting. This is the first time that complex interactions among genes are decoded and effectively incorporated into neural networks through embeddings for cancer classification. Somatic point mutation data from The Cancer Genome Atlas (TCGA) is used in this study. Two gene embeddings are applied: Gene2Vec and a TCGA embedding. Gene2Vec is a vector representation of all human genes, whereas the TCGA embedding is extracted from TCGA representing cancer-related genes. After preprocessing of raw data by gene filtering and sparsity reduction, a multi-layer deep learning neural network GEM-CC is built with the gene embeddings as the first layer. A feed-forward neural network with six fully connected hidden layers with dimensions ranging from 5000 to 100 is used as the backbone for training. The number of hidden layers and their dimensions are experimented to optimize model performance. Such an architecture is chosen to utilize the ability of an artificial neural network to learn and model non-linear complex relationships between genetic mutations and cancer types. The output is then fed to a softmax layer for multi-class classification. GEM-CC is validated with both standard 10-fold cross-validation and a holdout dataset. The model provides a prediction accuracy of up to 80% on 12 tumor classes with the combined gene embeddings, an improvement of more than 15% compared with previous studies. In light of the lack of a rapid and effective approach to cancer diagnosis, our research is the first to demonstrate the success of somatic mutation-based classification by incorporating gene expression embeddings. Furthermore, an analysis of the best-performing GEM-CC model reveals a number of genetic markers specific to each cancer type, which could be studied for a better understanding of the molecular mechanisms of cancer and utilized as targets for therapeutic drug development. Citation Format: Sidra Y. Xu. Gene embedding: A novel computational hybrid approach to somatic mutation-based primary cancer type identification and biomarker discovery [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 864.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call