Abstract

Massive amounts of metagenomics data are currently being produced, and in all such projects a sizeable fraction of the resulting data shows no or little homology to known sequences. It is likely that this fraction contains novel viruses, but identification is challenging since they frequently lack homology to known viruses. To overcome this problem, we developed a strategy to detect ORFan protein families in shotgun metagenomics data, using similarity-based clustering and a set of filters to extract bona fide protein families. We applied this method to 17 virus-enriched libraries originating from human nasopharyngeal aspirates, serum, feces, and cerebrospinal fluid samples. This resulted in 32 predicted putative novel gene families. Some families showed detectable homology to sequences in metagenomics datasets and protein databases after reannotation. Notably, one predicted family matches an ORF from the highly variable Torque Teno virus (TTV). Furthermore, follow-up from a predicted ORFan resulted in the complete reconstruction of a novel circular genome. Its organisation suggests that it most likely corresponds to a novel bacteriophage in the microviridae family, hence it was named bacteriophage HFM.

Highlights

  • Characterization of the human virome is crucial for our understanding of the role of the microbiome in health and disease

  • The detection of coding sequences with no homologs, or ORFans[10], in such datasets can be a first step towards the discovery of novel viral species, since novel protein sequences can be used as anchors for the characterization of entire viral genomes

  • Most current metagenomics gene finders rely on hidden Markov Models, which model statistical differences between coding and non-coding nucleotide frequencies and other features to estimate the probability that an open reading frame encodes a protein[12,13,14,15,16]

Read more

Summary

Introduction

Characterization of the human virome is crucial for our understanding of the role of the microbiome in health and disease. The shift from culture-based methods to metagenomics in recent years, combined with the development of virus particle enrichment protocols, has made it possible to efficiently study the entire flora of human viruses and bacteriophages associated with the human microbiome. Novel protein-coding genes can be detected by the alignment of unknown related protein sequences, combined with the use of KA/KS ratios to detect sequences that are under selection pressure The advantage of this method is that it searches for conserved signals directly within the dataset using a fixed statistical model, and do not depend on a previous training procedure. Ab-initio KA/ KS methods are best applied when analysing together diverse datasets, since it leverages protein diversity to make accurate predictions This strategy was used by the Global Oceanic Survey (GOS), which identified ~1700 putative novel ORFan protein families[17]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call