Abstract

BackgroundAmplicon pyrosequencing targets a known genetic region and thus inherently produces reads highly anticipated to have certain features, such as conserved nucleotide sequence, and in the case of protein coding DNA, an open reading frame. Pyrosequencing errors, consisting mainly of nucleotide insertions and deletions, are on the other hand likely to disrupt open reading frames. Such an inverse relationship between errors and expectation based on prior knowledge can be used advantageously to guide the process known as basecalling, i.e. the inference of nucleotide sequence from raw sequencing data.ResultsThe new basecalling method described here, named Multipass, implements a probabilistic framework for working with the raw flowgrams obtained by pyrosequencing. For each sequence variant Multipass calculates the likelihood and nucleotide sequence of several most likely sequences given the flowgram data. This probabilistic approach enables integration of basecalling into a larger model where other parameters can be incorporated, such as the likelihood for observing a full-length open reading frame at the targeted region. We apply the method to 454 amplicon pyrosequencing data obtained from a malaria virulence gene family, where Multipass generates 20 % more error-free sequences than current state of the art methods, and provides sequence characteristics that allow generation of a set of high confidence error-free sequences.ConclusionsThis novel method can be used to increase accuracy of existing and future amplicon sequencing data, particularly where extensive prior knowledge is available about the obtained sequences, for example in analysis of the immunoglobulin VDJ region where Multipass can be combined with a model for the known recombining germline genes. Multipass is available for Roche 454 data at http://www.cbs.dtu.dk/services/MultiPass-1.0, and the concept can potentially be implemented for other sequencing technologies as well.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1032-7) contains supplementary material, which is available to authorized users.

Highlights

  • Amplicon pyrosequencing targets a known genetic region and inherently produces reads highly anticipated to have certain features, such as conserved nucleotide sequence, and in the case of protein coding DNA, an open reading frame

  • Forward and reverse primers were designed by adding GS FLX Titanium Primer sequence and 10 bp multiplex identifier (MID) tags published by regions using GS FLX Titanium chemistry (Roche) (Roche 454 Sequencing Technical Bulletin No 013-2009; 454 Sequencing Technical Bulletin No 005-2009)

  • When we performed maximum likelihood fitting of normal distributions to flow signal distributions from homopolymers longer than 5 nucleotides present in the Plasmodium falciparum (Pf) DBLα-tag reference sequences, we found that these homopolymers gave lower flow signals than what was expected from the Balzer extrapolation, and that normal distributions shifted towards zero described the 454 data more accurately (Additional file 1: Figure S1)

Read more

Summary

Introduction

Amplicon pyrosequencing targets a known genetic region and inherently produces reads highly anticipated to have certain features, such as conserved nucleotide sequence, and in the case of protein coding DNA, an open reading frame. Pyrosequencing errors, consisting mainly of nucleotide insertions and deletions, are on the other hand likely to disrupt open reading frames. Such an inverse relationship between errors and expectation based on prior knowledge can be used advantageously to guide the process known as basecalling, i.e. the inference of nucleotide sequence from raw sequencing data. 454 pyrosequencing is distinguished from other available high throughput methods by its long read length, as well as the main error type inherent to the method which is insertions and deletions (indels), occurring at a rate around 1 % [4]. The PTP is exposed to flows of PCR reagents across the open wells with one nucleotide type at a time. Each read is given as a sequence of 800 preprocessed light intensities, known as a flowgram, where each flow value gives the length of a homopolymer (HP) in the read

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call