Abstract

Viruses and their host genomes often share similar oligonucleotide frequency (ONF) patterns, which can be used to predict the host of a given virus by finding the host with the greatest ONF similarity. We comprehensively compared 11 ONF metrics using several k-mer lengths for predicting host taxonomy from among ∼32 000 prokaryotic genomes for 1427 virus isolate genomes whose true hosts are known. The background-subtracting measure n}{}d_2^* at k = 6 gave the highest host prediction accuracy (33%, genus level) with reasonable computational times. Requiring a maximum dissimilarity score for making predictions (thresholding) and taking the consensus of the 30 most similar hosts further improved accuracy. Using a previous dataset of 820 bacteriophage and 2699 bacterial genomes, n}{}d_2^* host prediction accuracies with thresholding and consensus methods (genus-level: 64%) exceeded previous Euclidian distance ONF (32%) or homology-based (22-62%) methods. When applied to metagenomically-assembled marine SUP05 viruses and the human gut virus crAssphage, n}{}d_2^*-based predictions overlapped (i.e. some same, some different) with the previously inferred hosts of these viruses. The extent of overlap improved when only using host genomes or metagenomic contigs from the same habitat or samples as the query viruses. The n}{}d_2^* ONF method will greatly improve the characterization of novel, metagenomic viruses.

Highlights

  • It is widely recognized that the ‘uncultured majority’ of bacteria and archaea dominate biomass in many ecosystem, control important global biogeochemical cycles, and significantly impact the health of humans, animals, and crops [1]

  • We tested the utility of 11 different oligonucleotide frequency (ONF) distance/dissimilarity measures which belong to two major classes: those that use observed ONFs––Euclidean distance (Eu), Manhattan distance (Ma), Chebyshev distance (Ch), d2, and Jensen-Shannon divergence (JS)––and those that take into account background k-mer frequencies of the host and virus––d2∗, d2S, Hao, Teeling, EuF and Willner

  • Two major components impact the performance of these approaches––the type of measure and k-mer length used

Read more

Summary

Introduction

It is widely recognized that the ‘uncultured majority’ of bacteria and archaea (prokaryotes) dominate biomass in many ecosystem, control important global biogeochemical cycles, and significantly impact the health of humans, animals, and crops [1]. Viruses generally outnumber the abundance of prokaryotes and are estimated to represent the most abundant biological entity on the planet [2,3]. They are important in limiting the abundance of their hosts, they can significantly impact the processes and ecosystem functions that prokaryotes carry out [4,5,6]. Viruses are important mediators of evolution of their hosts. Viruses mediate horizontal gene transfer, and thereby act as a key mediator of host genomic innovation [7]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call