Abstract

An accurate diagnosis and prognosis for cancer are specific to patients with particular cancer types and molecular traits, which needs to address carefully. The discovery of important biomarkers is becoming an important step toward understanding the molecular mechanisms of carcinogenesis in which genomics data and clinical outcomes need to be analyzed before making any clinical decision. Copy number variations (CNVs) are found to be associated with the risk of individual cancers and hence can be used to reveal genetic predispositions before cancer develops. In this paper, we collect the CNVs data about 8000 cancer patients covering 14 different cancer types from The Cancer Genome Atlas. Then, two different sparse representations of CNVs based on 578 oncogenes and 20,308 protein-coding genes, including genomic deletions and duplication across the samples, are prepared. Then, we train Conv-LSTM and convolutional autoencoder (CAE) networks using both representations and create snapshot models. While the Conv-LSTM can capture locally and globally important features, CAE can utilize unsupervised pretraining to initialize the weights in the subsequent convolutional layers against the sparsity. Model averaging ensemble (MAE) is then applied to combine the snapshot models in order to make a single prediction. Finally, we identify most significant CNVs biomarkers using guided-gradient class activation map plus (GradCAM++) and rank top genes for different cancer types. Results covering several experiments show fairly high prediction accuracies for the majority of cancer types. In particular, using protein-coding genes, Conv-LSTM and CAE networks can predict cancer types correctly at least 72.96% and 76.77% of the cases, respectively. Contrarily, using oncogenes gives moderately higher accuracies of 74.25% and 78.32%, whereas the snapshot model based on MAE shows overall 2.5% of accuracy improvement.

Highlights

  • Cancer results from highly expressed genes due to mutations or alterations in gene regulations that control cell division and cell growth

  • Using MSeq-Copy number variations (CNVs), we selected a fixed number of genes and extracted the copy numbers (CNs) that overlapped with the gene locations, removing them from the protein noncoding gene because arguably more than 80% of human genes do not encode any protein, i.e., CNs from these regions have little-to-no effect on the tumor growth

  • The second LSTM layer emits an output ‘H,’ which is reshaped into a feature sequence to feed into fully connected layers to predict the cancer types at the timestep dimension, this helps produce a sequence vector from the last LSTM layer, which will hopefully force the CNVs of specific genes that are highly indicative of being responsible for specific cancer type

Read more

Summary

Introduction

Cancer results from highly expressed genes due to mutations or alterations in gene regulations that control cell division and cell growth. The significance is not fully understood, it is likely that CNVs are responsible for a considerable proportion of phenotypic variation [39] Such variations may lead to changes in gene dosage and expression [12]. CNVs are hypothesized to be of functional significance These changes in GE are responsible for different phenotypic variations or diseases (e.g., disabilities, diabetes, schizophrenia, cancer, and obesity) or envisaged to be associated with other diseases, e.g., autism spectrum disorder [4, 34, 37]. The extracted CNVs data were used to train machine learning (ML) models for cancer identification and type prediction. These approaches, are not capable of simultaneous analysis of multiple samples and recurrent CNVs [32].

Related works
Data collection
Data preprocessing
Feature extraction based on protein-coding genes
Feature extraction based on oncogenes
Network constructions and training
Conv-LSTM network
Convolutional autoencoder classifier
Ensemble of classifiers
Networks training
Finding and validating important biomarkers
Hyperparameter tuning
Experiment results
Experiment setup
Performance analysis of individual model
Performance analysis of the ensemble model
Validation of the top biomarkers
Analysis of the common biomarkers
Comparisons with related works
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call