A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Thomas Vanhaeren,Miguel García-Torres,Pedro Manuel Martínez-García,Francisco Gómez-Vela,Federico Divina,Wim Vanhoof

doi:10.3390/genes11090985

Abstract

The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

Highlights

Mammalian genomes stretch for more than 2 meters and are formed by around 3 billion base pairs that are tightly packed within the nucleus, which has a width on the order of micrometers.Strikingly, this level of compaction is compatible with a proper accessibility to the cellular machinery required for essential metabolic processes like replication or transcription
We model chromatin loops using an integrative approach based on ENCODE 1D sequencing datasets to test the performance of 5 different machine learning algorithms: decision trees (DT), random forests (RF), gradient boosting (XGBoost), Support Vector Machines (SVMs) and multi-layer perceptron (MLP)
We used published ChIA-PET datasets from two human cancer cell lines (Table S1), K562 and GM12878, identifying 3,290 and 5,486 RAD21-mediated chromatin loops, respectively. 1D sequencing datasets from ENCODE (ENCODE Project Consortium, 2012) were collected in order to represent the chromatin features associated to loops and their genomic neighbourhoods (Figure 1, Figure S1)

Summary

Introduction

Mammalian genomes stretch for more than 2 meters and are formed by around 3 billion base pairs that are tightly packed within the nucleus, which has a width on the order of micrometers. This level of compaction is compatible with a proper accessibility to the cellular machinery required for essential metabolic processes like replication or transcription. Beyond nucleosome-nucleosome interactions, chromatin loops represent the smallest scale of genome organization. The interactions of TADs with one another make up megabase-scale structures that extend to whole chromosomes and are known as nuclear compartments [7]. Since there is not yet a well characterized biological delineation between such orders of genome organization, in this study we will use the term ’loop’ to refer indistinctly to chromatin interactions at any level of this hierarchy

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Genes	Publication Date: Aug 24, 2020
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genes

Lead the way for us

Similar Papers

Ensemble Learners of Multiple Deep CNNs for Pulmonary Nodules Classification Using CT Images
Baihua Zhang ... Wei Qian
IEEE Access | VOL. 7
Baihua Zhang, et. al.Baihua Zhang ... Wei Qian
01 Jan 2019
IEEE Access | VOL. 7

A Systematic Analysis and Review on Intrusion Detection Systems Using Machine Learning and Deep Learning Algorithms
Sneha Leela Jacob ... Parveen Sultana Habibullah
Journal of Computational and Cognitive Engineering | VOL. -
Sneha Leela Jacob, et. al.Sneha Leela Jacob ... Parveen Sultana Habibullah
04 Jul 2024
Journal of Computational and Cognitive Engineering | VOL. -

Mapping Landslide Susceptibility Using Machine Learning Algorithms and GIS: A Case Study in Shexian County, Anhui Province, China
Zitao Wang ... Qimeng Liu
Symmetry | VOL. 12
Zitao Wang, et. al.Zitao Wang ... Qimeng Liu
26 Nov 2020
Symmetry | VOL. 12

Prevalence and Early Prediction of Diabetes Using Machine Learning in North Kashmir: A Case Study of District Bandipora.
Salliah Shafi Bhat ... Gufran Ahmad Ansari
Computational Intelligence and Neuroscience | VOL. 2022
Salliah Shafi Bhat, et. al.Salliah Shafi Bhat ... Gufran Ahmad Ansari
04 Oct 2022
Computational Intelligence and Neuroscience | VOL. 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genes