Abstract
BackgroundIn many environmental genomics applications a homologous region of DNA from a diverse sample is first amplified by PCR and then sequenced. The next generation sequencing technology, 454 pyrosequencing, has allowed much larger read numbers from PCR amplicons than ever before. This has revolutionised the study of microbial diversity as it is now possible to sequence a substantial fraction of the 16S rRNA genes in a community. However, there is a growing realisation that because of the large read numbers and the lack of consensus sequences it is vital to distinguish noise from true sequence diversity in this data. Otherwise this leads to inflated estimates of the number of types or operational taxonomic units (OTUs) present. Three sources of error are important: sequencing error, PCR single base substitutions and PCR chimeras. We present AmpliconNoise, a development of the PyroNoise algorithm that is capable of separately removing 454 sequencing errors and PCR single base errors. We also introduce a novel chimera removal program, Perseus, that exploits the sequence abundances associated with pyrosequencing data. We use data sets where samples of known diversity have been amplified and sequenced to quantify the effect of each of the sources of error on OTU inflation and to validate these algorithms.ResultsAmpliconNoise outperforms alternative algorithms substantially reducing per base error rates for both the GS FLX and latest Titanium protocol. All three sources of error lead to inflation of diversity estimates. In particular, chimera formation has a hitherto unrealised importance which varies according to amplification protocol. We show that AmpliconNoise allows accurate estimates of OTU number. Just as importantly AmpliconNoise generates the right OTUs even at low sequence differences. We demonstrate that Perseus has very high sensitivity, able to find 99% of chimeras, which is critical when these are present at high frequencies.ConclusionsAmpliconNoise followed by Perseus is a very effective pipeline for the removal of noise. In addition the principles behind the algorithms, the inference of true sequences using Expectation-Maximization (EM), and the treatment of chimera detection as a classification or 'supervised learning' problem, will be equally applicable to new sequencing technologies as they appear.
Highlights
In many environmental genomics applications a homologous region of DNA from a diverse sample is first amplified by PCR and sequenced
We truncated the reads at 220 bp and 400 bp for GS FLX and Titanium respectively before calculating exact pairwise sequence distances for the single-linkage preclustering (SLP) algorithm
For SLP we used the same filtered reads as for AmpliconNoise but this was not possible for the DeNoiser since there filtering is through the QIIME pipeline [20]
Summary
In many environmental genomics applications a homologous region of DNA from a diverse sample is first amplified by PCR and sequenced. The generation sequencing technology, 454 pyrosequencing, has allowed much larger read numbers from PCR amplicons than ever before This has revolutionised the study of microbial diversity as it is possible to sequence a substantial fraction of the 16S rRNA genes in a community. One technology that is finding many applications, for example in de novo genome sequencing, or diversity studies of regions of DNA that have been amplified by PCR, is 454 Pyrosequencing [1]. It is this latter application of 454 to the sequencing of PCR products or amplicons that we will focus on here. This has many applications for instance in viral population dynamics [2], or characterising microbial communities through amplification of 16S rRNA genes [3]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.