An information-theoretic approach to the modeling and analysis of whole-genome bisulfite sequencing data

Garrett Jenkinson,Andrew P Feinberg,John Goutsias,Jordi Abante

doi:10.1186/s12859-018-2086-5

Garrett Jenkinson, Andrew P Feinberg + Show 2 more

Open Access

https://doi.org/10.1186/s12859-018-2086-5

Copy DOI

Abstract

BackgroundDNA methylation is a stable form of epigenetic memory used by cells to control gene expression. Whole genome bisulfite sequencing (WGBS) has emerged as a gold-standard experimental technique for studying DNA methylation by producing high resolution genome-wide methylation profiles. Statistical modeling and analysis is employed to computationally extract and quantify information from these profiles in an effort to identify regions of the genome that demonstrate crucial or aberrant epigenetic behavior. However, the performance of most currently available methods for methylation analysis is hampered by their inability to directly account for statistical dependencies between neighboring methylation sites, thus ignoring significant information available in WGBS reads.ResultsWe present a powerful information-theoretic approach for genome-wide modeling and analysis of WGBS data based on the 1D Ising model of statistical physics. This approach takes into account correlations in methylation by utilizing a joint probability model that encapsulates all information available in WGBS methylation reads and produces accurate results even when applied on single WGBS samples with low coverage. Using the Shannon entropy, our approach provides a rigorous quantification of methylation stochasticity in individual WGBS samples genome-wide. Furthermore, it utilizes the Jensen-Shannon distance to evaluate differences in methylation distributions between a test and a reference sample. Differential performance assessment using simulated and real human lung normal/cancer data demonstrate a clear superiority of our approach over DSS, a recently proposed method for WGBS data analysis. Critically, these results demonstrate that marginal methods become statistically invalid when correlations are present in the data.ConclusionsThis contribution demonstrates clear benefits and the necessity of modeling joint probability distributions of methylation using the 1D Ising model of statistical physics and of quantifying methylation stochasticity using concepts from information theory. By employing this methodology, substantial improvement of DNA methylation analysis can be achieved by effectively taking into account the massive amount of statistical information available in WGBS data, which is largely ignored by existing methods.

Highlights

DNA methylation is a stable form of epigenetic memory used by cells to control gene expression
Classification of genomic units To provide an effective interpretation of the mean methylation level (MML) output, we developed a classification scheme that summarizes the status of methylation level within a Genomic unit (GU) based on the shape of its probability mass function (PMF) (Additional file 1: Section 6.1)
M(k) is the number of available observations within an estimation region Rk, Rk(m) is the set of all CpG sites within Rk whose methylation state is measured in the m-th observation, PX(i)({x(rm), r ∈ Rk(m)} | θ ) is the likelihood of the m-th observed sample associated with the i-th model, obtained by marginalizing the entire likelihood PX(i)(x | θ ) over the “unmeasured” CpG sites, θi(k) is the maximum-likelihood estimate of the parameters associated with the i-th model, and pi(k) is the corresponding number of free parameters [p1(k) = 5 and p2(k) = 2R(k) − 1, with R(k) being the number of CpG sites in Rk]

Summary

Introduction

DNA methylation is a stable form of epigenetic memory used by cells to control gene expression. Whole genome bisulfite sequencing (WGBS) has emerged as a gold-standard experimental technique for studying DNA methylation by producing high resolution genome-wide methylation profiles. Statistical modeling and analysis is employed to computationally extract and quantify information from these profiles in an effort to identify regions of the genome that demonstrate crucial or aberrant epigenetic behavior. Several experimental assays have been designed to map DNA methylation marks, whole-genome bisulfite sequencing (WGBS) is increasingly becoming the method of choice due to its high quantitative accuracy, resolution, and genome-wide coverage [4]. Extraction of methylation information from bisulfite data has led to many parametric and non-parametric methods for modeling, analysis, and interpretation [4, 5]. Other important methods follow a more direct approach, but they have only been designed to detect differential methylation in data obtained by Illumina’s 450k arrays [18, 19], whose continuous intensity measurements require fundamentally different models and methods, when compared to discrete sequencing reads

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 7, 2018
Citations: 25	License type: open-access

R Discovery Prime

R Discovery Prime

An information-theoretic approach to the modeling and analysis of whole-genome bisulfite sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

A pipeline for sample tagging of whole genome bisulfite sequencing data using genotypes of whole genome sequencing
Zhe Xu ... Yong Jiang
BMC Genomics | VOL. 24
Zhe Xu, et. al.Zhe Xu ... Yong Jiang
23 Jun 2023
BMC Genomics | VOL. 24

Study on detection of CNVs using human whole genome bisulfite sequencing data.
Dan-Tong Xu ... Xiao-Long Yuan
Yi chuan = Hereditas | VOL. 45
Dan-Tong Xu, et. al.Dan-Tong Xu ... Xiao-Long Yuan
20 Apr 2023
Yi chuan = Hereditas | VOL. 45

Abstract 4530: EpiCapture: Benchmarking commercially available targeted bisulfite-sequencing platforms to gold-standard whole genome bisulfite sequencing
Miljana Tanic ... Simon Rodney
Cancer Research | VOL. 76
Miljana Tanic, et. al.Miljana Tanic ... Simon Rodney
15 Jul 2016
Cancer Research | VOL. 76

An integrative approach for efficient analysis of whole genome bisulfite sequencing data.
Jong-Hun Lee ... Nakai Kenta
BMC genomics | VOL. Suppl 16 12
Jong-Hun Lee, et. al.Jong-Hun Lee ... Nakai Kenta
01 Dec 2015
BMC genomics | VOL. Suppl 16 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An information-theoretic approach to the modeling and analysis of whole-genome bisulfite sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics