Abstract

Giant viruses are widespread in the biosphere and play important roles in biogeochemical cycling and host genome evolution. Also known as nucleo-cytoplasmic large DNA viruses (NCLDVs), these eukaryotic viruses harbor the largest and most complex viral genomes known. Studies have shown that NCLDVs are frequently abundant in metagenomic datasets, and that sequences derived from these viruses can also be found endogenized in diverse eukaryotic genomes. The accurate detection of sequences derived from NCLDVs is therefore of great importance, but this task is challenging owing to both the high level of sequence divergence between NCLDV families and the extraordinarily high diversity of genes encoded in their genomes, including some encoding for metabolic or translation-related functions that are typically found only in cellular lineages. Here, we present ViralRecall, a bioinformatic tool for the identification of NCLDV signatures in ‘omic data. This tool leverages a library of giant virus orthologous groups (GVOGs) to identify sequences that bear signatures of NCLDVs. We demonstrate that this tool can effectively identify NCLDV sequences with high sensitivity and specificity. Moreover, we show that it can be useful both for removing contaminating sequences in metagenome-assembled viral genomes as well as the identification of eukaryotic genomic loci that derived from NCLDV. ViralRecall is written in Python 3.5 and is freely available on GitHub: https://github.com/faylward/viralrecall.

Highlights

  • Nucleo-cytoplasmic large DNA viruses (NCLDVs) are a diverse group of viruses that comprise several divergent lineages; the average amino acid identity between NCLDVs from different families can be as low as ~20%, and in some cases it can be difficult to identify any signatures of homology between disparate NCLDV genomes [10]

  • We present several benchmarking criteria which establish that ViralRecall v. 2.0 has high specificity and sensitivity and can be used for several applications, including the removal of non-NCLDV contamination from metagenome-derived viral genomes and the identification of endogenous NCLDVs in eukaryotic genomes

  • We found that some giant virus orthologous groups (GVOGs) are commonly found in Caudovirales as well as NCLDVs, and we normalized the HMMER score of these GVOGs to avoid false-positive detection

Read more

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. NCLDVs are a diverse group of viruses that comprise several divergent lineages; the average amino acid identity between NCLDVs from different families can be as low as ~20%, and in some cases it can be difficult to identify any signatures of homology between disparate NCLDV genomes [10]. Tools such as MG-Digger [25], Giant Virus. The. VOG database contains orthologous groups from numerous viruses; this tool could not discriminate between different viral groups, and several additional analyses were needed to trace the specific viral provenance of the sequences identified. We present several benchmarking criteria which establish that ViralRecall v. 2.0 has high specificity and sensitivity and can be used for several applications, including the removal of non-NCLDV contamination from metagenome-derived viral genomes and the identification of endogenous NCLDVs in eukaryotic genomes

NCDLV Genomes Used for Database Construction
Calculation of ViralRecall Scores
Benchmarking
Results and Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call