Abstract

Millions of new viral sequences have been identified from metagenomes, but the quality and completeness of these sequences vary considerably. Here we present CheckV, an automated pipeline for identifying closed viral genomes, estimating the completeness of genome fragments and removing flanking host regions from integrated proviruses. CheckV estimates completeness by comparing sequences with a large database of complete viral genomes, including 76,262 identified from a systematic search of publicly available metagenomes, metatranscriptomes and metaviromes. After validation on mock datasets and comparison to existing methods, we applied CheckV to large and diverse collections of metagenome-assembled viral sequences, including IMG/VR and the Global Ocean Virome. This revealed 44,652 high-quality viral genomes (that is, >90% complete), although the vast majority of sequences were small fragments, which highlights the challenge of assembling viral genomes from short-read metagenomes. Additionally, we found that removal of host contamination substantially improved the accurate identification of auxiliary metabolic genes and interpretation of viral-encoded functions.

Highlights

  • Millions of new viral sequences have been identified from metagenomes, but the quality and completeness of these sequences vary considerably

  • It is organized into three modules which identify and remove host contamination on proviruses (Fig. 1a), estimate completeness for genome fragments (Fig. 1b) and predict closed genomes based on terminal repeats and flanking host regions (Fig. 1c)

  • We found that 90.0% of the direct terminal repeats (DTRs) contigs with estimated completeness met the high-quality standard compared to only 46.4% of complete proviruses and 33.2% of inverted terminal repeats (ITRs) (Extended Data Fig. 6)

Read more

Summary

Introduction

Millions of new viral sequences have been identified from metagenomes, but the quality and completeness of these sequences vary considerably. After validation on mock datasets and comparison to existing methods, we applied CheckV to large and diverse collections of metagenome-assembled viral sequences, including IMG/VR and the Global Ocean Virome This revealed 44,652 high-quality viral genomes (that is, >90% complete), the vast majority of sequences were small fragments, which highlights the challenge of assembling viral genomes from short-read metagenomes. VIBRANT11 and viralComplete[22] are two recently published tools utilized to address these problems: VIBRANT categorizes sequences into quality tiers based on circularity and the presence of viral hallmark proteins, as well as nucleotide replication proteins, while viralComplete estimates completeness based on affiliation to known viruses from NCBI RefSeq. With regard to host contamination on proviruses, existing approaches either remove viral contigs containing a high fraction of microbial genes[5] or predict host–virus boundaries[10,11,23,24]. With the diversity of available viral prediction pipelines and protocols, there is a need for a standalone tool to ensure that viral contigs do not contain contamination, and to remove it when present

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call