Abstract
BackgroundArtificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates.ResultsWe implemented a method for identification of exact and nearly identical duplicates from pyrosequencing reads. This method performs an all-against-all sequence comparison and clusters the duplicates into groups using an algorithm modified from our previous sequence clustering method cd-hit. This method can process a typical dataset in ~10 minutes; it also provides a consensus sequence for each group of duplicates. We applied this method to the underlying raw reads of 39 genomic projects and 10 metagenomic projects that utilized pyrosequencing technique. We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations. We observed that the duplicates, including both artificial and natural duplicates, make up 4-44% of reads. The number of natural duplicates highly correlates with the samples' read density (number of reads divided by genome size). For high-complexity metagenomic samples lacking dominant species, natural duplicates only make up <1% of all duplicates. But for some other samples like transcriptomic samples, majority of the observed duplicates might be natural duplicates.ConclusionsOur method is available from http://cd-hit.org as a downloadable program and a web server. It is important not only to identify the duplicates from metagenomic datasets but also to distinguish whether they are artificial or natural duplicates. We provide a tool to estimate the number of natural duplicates according to user-defined sample types, so users can decide whether to retain or remove duplicates in their projects.
Highlights
Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies
We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations
Duplicated reads of metagenomic datasets We studied the pyrosequencing reads for 10 metagenomic datasets (Table 4) of different environments from NCBI Short Read Archive (SRA) or from CAMERA metagenomic project http://camera.calit2.net
Summary
Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. Since the duplicated reads observed in a pyrosequencing run include natural (non-artificial) duplicates, removing all duplicates may cause underestimation of abundance associated with natural duplicates. It is known that the 454 sequencers produce artificially duplicated reads, which might lead to misleading conclusions. Exact duplicates sometimes were removed before data analyses [7]. In the study by Gomez-Alvarez et al [11], nearly identical duplicates, the reads that begin at the same position but may vary in length or bear mismatches, were classified as artifacts. Exact and nearly identical duplicates may make up 11~35% of the raw reads
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have