Abstract
Motivation: Reliable estimation of the mean fragment length for next-generation short-read sequencing data is an important step in next-generation sequencing analysis pipelines, most notably because of its impact on the accuracy of the enriched regions identified by peak-calling algorithms. Although many peak-calling algorithms include a fragment-length estimation subroutine, the problem has not been adequately solved, as demonstrated by the variability of the estimates returned by different algorithms.Results: In this article, we investigate the use of strand cross-correlation to estimate mean fragment length of single-end data and show that traditional estimation approaches have mixed reliability. We observe that the mappability of different parts of the genome can introduce an artificial bias into cross-correlation computations, resulting in incorrect fragment-length estimates. We propose a new approach, called mappability-sensitive cross-correlation (MaSC), which removes this bias and allows for accurate and reliable fragment-length estimation. We analyze the computational complexity of this approach, and evaluate its performance on a test suite of NGS datasets, demonstrating its superiority to traditional cross-correlation analysis.Availability: An open-source Perl implementation of our approach is available at http://www.perkinslab.ca/Software.html.Contact: tperkins@ohri.caSupplementary information: Supplementary data are available at Bioinformatics online.
Highlights
Next-generation sequencing (NGS) technologies have revolutionized molecular biology with their unprecedented capacity for genome-wide measurement of protein–DNA interactions, chromatin state changes and transcription levels (Mardis, 2011)
For example, a DNA sample that is the result of a chromatinimmunoprecipitation experiment, in which DNA bound to a particular transcription factor (TF) is pulled down
We have demonstrated that mappability can introduce a strong bias into genome-wide cross-correlation computations of positive- and negative-strand read densities
Summary
Next-generation sequencing (NGS) technologies have revolutionized molecular biology with their unprecedented capacity for genome-wide measurement of protein–DNA interactions, chromatin state changes and transcription levels (Mardis, 2011). NGS technologies differ in their details, most of the common platforms work by sequencing large numbers of shortDNA fragments. These fragments may originate, for example, from simple extraction of DNA from a sample of cells, selective extraction based on a chromatin-immunoprecipitation pulldown or reverse transcription of RNA into DNA. When the organism does have a canonical genome, the DNA fragment sequences are typically mapped back to the canonical genome, so that their distribution, and especially sites of enrichment, may be studied (Pepke et al, 2009). The best practical alternative offered by typical current technologies is sequencing the fragments starting from both ends. Despite having a canonical genome assembly to which one end of each fragment can be mapped, most NGS experiments lack information on the other, unsequenced end of each fragment
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.