Abstract

We developed a fast method to construct local sub-databases from the NCBI-nr database for the quick similarity search and annotation of huge metagenomic datasets based on BLAST-MEGAN approach. A three-step sub-database annotation pipeline (SAP) was further proposed to conduct the annotation in a much more time-efficient way which required far less computational capacity than the direct NCBI-nr database BLAST-MEGAN approach. The 1st BLAST of SAP was conducted using the original metagenomic dataset against the constructed sub-database for a quick screening of candidate target sequences. Then, the candidate target sequences identified in the 1st BLAST were subjected to the 2nd BLAST against the whole NCBI-nr database. The BLAST results were finally annotated using MEGAN to filter out those mistakenly selected sequences in the 1st BLAST to guarantee the accuracy of the results. Based on the tests conducted in this study, SAP achieved a speedup of ∼150–385 times at the BLAST e-value of 1e–5, compared to the direct BLAST against NCBI-nr database. The annotation results of SAP are exactly in agreement with those of the direct NCBI-nr database BLAST-MEGAN approach, which is very time-consuming and computationally intensive. Selecting rigorous thresholds (e.g. e-value of 1e–10) would further accelerate SAP process. The SAP pipeline may also be coupled with novel similarity search tools (e.g. RAPsearch) other than BLAST to achieve even faster annotation of huge metagenomic datasets. Above all, this sub-database construction method and SAP pipeline provides a new time-efficient and convenient annotation similarity search strategy for laboratories without access to high performance computing facilities. SAP also offers a solution to high performance computing facilities for the processing of more similarity search tasks.

Highlights

  • High-throughput sequencing (HTS), such as 454 pyrosequencing and Illumina sequencing, have been recently applied as novel promising methods to investigate genes or gene expression of microbial communities in different habitats, such as marine water [1], soil [2], human guts [3], oral cavities [4], and activated sludge [5,6]

  • Whole NCBI-nr Database sub-database annotation pipeline (SAP) was verified by comparing the MEGAN annotation results of SAP-BLAST outputs with the MEGAN annotation results of direct NCBI-nr BLAST outputs

  • BLAST was first conducted using two e-value cutoffs (1e–5 and 1e–10) and MEGAN was applied to annotate BLAST output with default parameters

Read more

Summary

Introduction

High-throughput sequencing (HTS), such as 454 pyrosequencing and Illumina sequencing, have been recently applied as novel promising methods to investigate genes or gene expression of microbial communities in different habitats, such as marine water [1], soil [2], human guts [3], oral cavities [4], and activated sludge [5,6]. BLAST is the most commonly used similarity search tool that is designed to find distant homologous sequences for taxonomic and functional attributes [8], but requires tremendous computational capacity It will take a month for a 1000-CPU computer cluster to conduct a full BLASTX search against the whole NCBInr database (amino acid sequences of ,4 Gigabytes (GB)) for a 20 Giga base pairs (Gbp) DNA dataset [8]. It took approximately 3 weeks to search a set of 100 Mbp DNA against NCBI-nr database using a BLASTX on a workstation (Lenovo ThinkStation-D20: CPU 2.40 GHz616 threads; Memory 96 GB) It will be a great challenge for those laboratories without access to super-computers to analyze the huge HTS metagenomic dataset by BLASTX against NCBI-nr approach

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call