Abstract

High-throughput bisulfite sequencing technologies have provided a comprehensive and well-fitted way to investigate DNA methylation at single-base resolution. However, there are substantial bioinformatic challenges to distinguish precisely methylcytosines from unconverted cytosines based on bisulfite sequencing data. The challenges arise, at least in part, from cell heterozygosis caused by multicellular sequencing and the still limited number of statistical methods that are available for methylcytosine calling based on bisulfite sequencing data. Here, we present an algorithm, termed Bycom, a new Bayesian model that can perform methylcytosine calling with high accuracy. Bycom considers cell heterozygosis along with sequencing errors and bisulfite conversion efficiency to improve calling accuracy. Bycom performance was compared with the performance of Lister, the method most widely used to identify methylcytosines from bisulfite sequencing data. The results showed that the performance of Bycom was better than that of Lister for data with high methylation levels. Bycom also showed higher sensitivity and specificity for low methylation level samples (<1%) than Lister. A validation experiment based on reduced representation bisulfite sequencing data suggested that Bycom had a false positive rate of about 4% while maintaining an accuracy of close to 94%. This study demonstrated that Bycom had a low false calling rate at any methylation level and accurate methylcytosine calling at high methylation levels. Bycom will contribute significantly to studies aimed at recalibrating the methylation level of genomic regions based on the presence of methylcytosines.

Highlights

  • DNA methylation is an important epigenetic modification involved in the regulation of gene expression and plays critical roles in cellular processes [1,2,3,4,5]

  • We presented a novel computational strategy, Bycom, for identifying precisely methylcytosines from bisulfite sequencing (BS-seq) data using the Bayes inference model

  • Bycom considers the impacts of sequencing errors and non-conversion rate, both of which are treated as the false-positive rate in the Lister method, and introduces cell heterozygosis to identify methylcytosines in an unbiased manner

Read more

Summary

Introduction

DNA methylation is an important epigenetic modification involved in the regulation of gene expression and plays critical roles in cellular processes [1,2,3,4,5]. The large data sets generated by BS-seq pose data processing challenges for methylcytosine calling. The first step of methylation analysis with BS-seq data is to map the bisulfite-converted reads to a reference genome using software such as SOAP and BSMAP [12,13,14]. Methylcytosines can be identified from the reads aligned to the cytosines on the reference genome. Besides sequencing errors, methylcytosine calling is affected by incomplete bisulfite conversion, which corresponds to the ratio of unmethylated cytosines that were not converted to thymines by the bisulfite treatment. Cell heterozygosis caused by multicellular sequencing can influence the precision of methylcytosine detection because the methylation status of the same cytosine site in different cell is probably inconsistent owing to the coexistence of methylation and demethylation [15,16,17,18]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call