Abstract

BackgroundChromosomal rearrangements are the typical phenomena in cancer genomes causing gene disruptions and fusions, corruption of regulatory elements, damage to chromosome integrity. Among the factors contributing to genomic instability are non-B DNA structures with stem-loops and quadruplexes being the most prevalent. We aimed at investigating the impact of specifically these two classes of non-B DNA structures on cancer breakpoint hotspots using machine learning approach.MethodsWe developed procedure for machine learning model building and evaluation as the considered data are extremely imbalanced and it was required to get a reliable estimate of the prediction power. We built logistic regression models predicting cancer breakpoint hotspots based on the densities of stem-loops and quadruplexes, jointly and separately. We also tested Random Forest models varying different resampling schemes (leave-one-out cross validation, train-test split, 3-fold cross-validation) and class balancing techniques (oversampling, stratification, synthetic minority oversampling).ResultsWe performed analysis of 487,425 breakpoints from 2234 samples covering 10 cancer types available from the International Cancer Genome Consortium. We showed that distribution of breakpoint hotspots in different types of cancer are not correlated, confirming the heterogeneous nature of cancer. It appeared that stem-loop-based model best explains the blood, brain, liver, and prostate cancer breakpoint hotspot profiles while quadruplex-based model has higher performance for the bone, breast, ovary, pancreatic, and skin cancer. For the overall cancer profile and uterus cancer the joint model shows the highest performance. For particular datasets the constructed models reach high predictive power using just one predictor, and in the majority of the cases, the model built on both predictors does not increase the model performance.ConclusionDespite the heterogeneity in breakpoint hotspots’ distribution across different cancer types, our results demonstrate an association between cancer breakpoint hotspots and stem-loops and quadruplexes. Approximately for half of the cancer types stem-loops are the most influential factors while for the others these are quadruplexes. This fact reflects the differences in regulatory potential of stem-loops and quadruplexes at the tissue-specific level, which yet to be discovered at the genome-wide scale. The performed analysis demonstrates that influence of stem-loops and quadruplexes on breakpoint hotspots formation is tissue-specific.

Highlights

  • Chromosomal rearrangements are the typical phenomena in cancer genomes causing gene disruptions and fusions, corruption of regulatory elements, damage to chromosome integrity

  • Analysis of association between cancer somatic mutations and different non-B DNA structures, including G-quadruplexes (G4), H-DNA, Z-DNA and direct, inverted, mirror and short tandem repeats, revealed two-fold mutation enrichment of the mutation regions by the non-B motifs and demonstrated that machine-learning models built on the densities of the non-B motifs and epigenetic factors either taken separately or jointly are able to predict the densities of somatic mutations [10]

  • Breakpoint hotspots Data on cancer breakpoints were downloaded from the International Cancer Genome Consortium (ICGC) Data Portal

Read more

Summary

Introduction

Chromosomal rearrangements are the typical phenomena in cancer genomes causing gene disruptions and fusions, corruption of regulatory elements, damage to chromosome integrity. Analysis of almost 700,000 somatic copy-number variant breakpoints from around 2800 cancer genomes demonstrated the enrichment of quadruplexes and DNA regions in the hypomethylated state in the vicinity of cancer breakpoints [8] Epigenetic features, such as chromatin accessibility and histone modifications of a particular type of cancer together with the replication timing explains up to 86% of the variance in single mutation densities for the selected cancer type [9]. Analysis of association between cancer somatic mutations and different non-B DNA structures, including G-quadruplexes (G4), H-DNA, Z-DNA and direct, inverted, mirror and short tandem repeats, revealed two-fold mutation enrichment of the mutation regions by the non-B motifs and demonstrated that machine-learning models built on the densities of the non-B motifs and epigenetic factors either taken separately or jointly are able to predict the densities of somatic mutations [10]. Machine-learning models using epigenomic and chromatin context reached good accuracy at 1kB resolution in predicting DSBs with chromatin accessibility, activity, and long-range contacts being the best predictors [11]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call