Abstract

In the past few years, the field of metagenomics has been growing at an accelerated pace, particularly in response to advancements in new sequencing technologies. The large volume of sequence data from novel organisms generated by metagenomic projects has triggered the development of specialized databases and tools focused on particular groups of organisms or data types. Here we describe a pipeline for the functional annotation of viral metagenomic sequence data. The Viral MetaGenome Annotation Pipeline (VMGAP) pipeline takes advantage of a number of specialized databases, such as collections of mobile genetic elements and environmental metagenomes to improve the classification and functional prediction of viral gene products. The pipeline assigns a functional term to each predicted protein sequence following a suite of comprehensive analyses whose results are ranked according to a priority rules hierarchy. Additional annotation is provided in the form of enzyme commission (EC) numbers, GO/MeGO terms and Hidden Markov Models together with supporting evidence.

Highlights

  • Viruses are the most abundant biological agents and comprise the majority of the biodiversity on Earth [1,2,3]

  • This has triggered an exponential growth in the amount of metagenomic sequencing data available within public repositories and stresses the necessity for specialized highly efficient computational tools to cope with the functional annotation of these massive datasets

  • In order to quantitatively assess the utility of viral metagenomic annotation pipeline (VMGAP) for the functional annotation of viral metagenomic data, we ran an identical set of ~300,000 peptide sequences from a marine viral metagenomic library or their respective coding open reading frame (ORF) through the VMGAP and MG-RAST respectively

Read more

Summary

Introduction

Viruses are the most abundant biological agents and comprise the majority of the biodiversity on Earth [1,2,3]. Metagenomic data originate from heterogeneous microbial communities, are usually noisy and partial, and reads frequently contain truncated open reading frames (ORFs) Complicating this landscape, the vast majority of viruses isolated from environmental samples are novel and most of their genes do not have homologous sequences in the public databases, making functional annotation even more difficult. While MG-RAST has been used for the functional annotation of multiple viral metagenomes [12], it is not ideal for the characterization of viral metagenomic data since functional classification is solely dependent on similarity to FIGfams [13], protein families developed from manually curated bacterial and archaeal proteins Another limitation of this tool is that it does not search for conserved protein domains or motifs that could provide additional clues about the functional roles of genes present in metagenomic samples. TIGRFAM [26] HMMDBs, ACLAME protein and HMMDBs [27],GenBank CDDDB [28] and pfam2gomappingsDB [11]

Procedure
21 Number of predicted transmembrane domains
Findings
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call