Abstract

BackgroundHistone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them.ResultsOur comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. These models, implemented in the R package CROCS (https://github.com/aLiehrmann/CROCS), detect the peaks more accurately than algorithms which rely on natural assumptions.ConclusionThe segmentation models we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3K4me3 histone modifications.

Highlights

  • Histone modification constitutes a basic mechanism for the genetic regulation of gene expression

  • We have shown that this over-dispersion can be effectively reduced in these datasets using either a negative binomial or a Gaussian transformed noise model

  • We developed the CROCS algorithm that computes all optimal models between two peak bounds, given any segmentation algorithm with constant penalty for each changepoint

Read more

Summary

Introduction

Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. Seq) is amongst the most widely used methods in molecular biology [15] This method aims to identify transcription factor binding sites [20, 22] or post-translational histone modifications [24, 25], referred to as histone marks, underlying regulatory elements. The ChIP-seq assay yields a set of DNA sequence reads which are aligned to Liehrmann et al BMC Bioinformatics (2021) 22:323 a reference genome and counted at each genomic position. The binding sites or histone marks of interest appear as regions with high read density referred to as peaks in the coverage profile

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call