Abstract

Background454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform.ResultsWe present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores.ConclusionsUsing a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.

Highlights

  • 454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics

  • Quality scores and base-calling probabilities HPCall provides estimated probabilities that a certain homopolymer length (HPL) is present given the values of all the input variables in the model

  • For a given cycle and nucleotide, the probability of being the correct HPL is estimated for each possible HPL based on different noise predictors, and the call corresponds to the HPL with the maximum probability

Read more

Summary

Introduction

454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A first step in the analysis of next-generation sequencing (NGS) data is the transformation of the measured intensity signals to a sequence of nucleotides This process, referred to as base-calling, is an important task, as systematic base-calling errors may mislead downstream analysis [1], e.g. in genome assembly and sequence mapping. More accurate base-calling and more reliable basecalling quality scores result in a better distinction between sequencing errors and true polymorphisms between the base-called reads and a reference sequence. This is an essential merit in the detection of single nucleotide polymorphisms (SNPs) or sequence variants [2,3,4]. This makes it harder to discriminate between subsequent homopolymer lengths, resulting in an inflation of undercalls or overcalls as the HPL increases (Additional file 1: Figure S1)

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.