Abstract

The confluence of deep sequencing and powerful machine learning is providing an unprecedented peek at the darkest of the dark genomic matter, the non-coding genomic regions lacking any functional annotation. While deep sequencing uncovers rare tumor variants, the heterogeneity of the disease confounds the best of machine learning (ML) algorithms. Here we set out to answer if the dark-matter of the genome encompass signals that can distinguish the fine subtypes of disease that are otherwise genomically indistinguishable. We introduce a novel stochastic regularization, ReVeaL, that empowers ML to discriminate subtle cancer subtypes even from the same ‘cell of origin’. Analogous to heritability, implicitly defined on whole genome, we use predictability (F1 score) definable on portions of the genome. In an effort to distinguish cancer subtypes using dark-matter DNA, we applied ReVeaL to a new WGS dataset from 727 patient samples with seven forms of hematological cancers and assessed the predictivity over several genomic regions including genic, non-dark, non-coding, non-genic, and dark. ReVeaL enabled improved discrimination of cancer subtypes for all segments of the genome. The non-genic, non-coding and dark-matter had the highest F1 scores, with dark-matter having the highest level of predictability. Based on ReVeaL’s predictability of different genomic regions, dark-matter contains enough signal to significantly discriminate fine subtypes of disease. Hence, the agglomeration of rare variants, even in the hitherto unannotated and ill-understood regions of the genome, may play a substantial role in the disease etiology and deserve much more attention.

Highlights

  • Since the completion of the Human Genome Project, progress has been made in understanding the genome, in diseases of the genome such as cancer

  • Equipped with ultra-deep whole genome sequencing (WGS) capabilities that dig out ever more rare variants and current machine learning (ML) capabilities with the potential to process large amounts of data undeterred by noise at various scales, we focus here on blood cancer

  • We found the predictability (F1 score) of an array of ML and Artificial Intelligence (AI) methods on patient genomic data algorithms to be disappointingly poor when we considered multiple types of features including individual alleles, individual genes (S5 Table), and windows of mutations from different genomic regions (Fig 1B)

Read more

Summary

Introduction

Since the completion of the Human Genome Project, progress has been made in understanding the genome, in diseases of the genome such as cancer. Large gaps continue to exist in our knowledge of mutational (genomic) markers vis-à-vis subtle disease subtypes. The primary focus has been on coding genes; the assumed instigators of cancer. Whole exome sequencing’s (WES) intrinsic focus on coding DNA, called the exonic, has naturally reinforced the centrality of coding genes as “cancer drivers” by exclusively discovering coding alterations associated to disease etiology. Classical oncogenetics believe passenger mutations accompany driver mutations throughout the genome but are inconsequential to tumorigenesis [1]. This model has been redefined with the suggestion that these passenger mutations, whether in coding or noncoding DNA, might have a role in cancer progression [2]. The aggregate effect of multiple weak passenger mutations may have a strong influence on tumorigenesis

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.