Abstract
Metagenomes can be considered as mixtures of viral, bacterial, and other eukaryotic DNA sequences. Mining viral sequences from metagenomes could shed insight into virus–host relationships and expand viral databases. Current alignment-based methods are unsuitable for identifying viral sequences from metagenome sequences because most assembled metagenomic contigs are short and possess few or no predicted genes, and most metagenomic viral genes are dissimilar to known viral genes. In this study, I developed a Markov model-based method, VirMC, to identify viral sequences from metagenomic data. VirMC uses Markov chains to model sequence signatures and construct a scoring model using a likelihood test to distinguish viral and bacterial sequences. Compared with the other two state-of-the-art viral sequence-prediction methods, VirFinder and PPR-Meta, my proposed method outperformed VirFinder and had similar performance with PPR-Meta for short contigs with length less than 400 bp. VirMC outperformed VirFinder and PPR-Meta for identifying viral sequences in contaminated metagenomic samples with eukaryotic sequences. VirMC showed better performance in assembling viral-genome sequences from metagenomic data (based on filtering potential bacterial reads). Applying VirMC to human gut metagenomes from healthy subjects and patients with type-2 diabetes (T2D) revealed that viral contigs could help classify healthy and diseased statuses. This alignment-free method complements gene-based alignment approaches and will significantly improve the precision of viral sequence identification.
Highlights
Viruses are obligate intracellular parasites that probably infect all cellular forms of life (Breitbart and Rohwer, 2005)
To generate validation and testing datasets containing 10, 50, or 90% viral contigs, the number of viral contigs was set as shown in Table 1, and the contigs were combined with nine times more, equal numbers, or ninefold less randomly sampled bacterial contigs discovered after 1 June 2015, respectively
The guanine-cytosine (GC) frequency of each bacterial genomic sequence was calculated. These bacterial genomic sequences were grouped into different bins using the quantiles of the GC-frequency distribution, and a Markov model was constructed for each bin
Summary
Viruses are obligate intracellular parasites that probably infect all cellular forms of life (Breitbart and Rohwer, 2005). At least 1031 virus particles exist globally at any given time in most environments in which the number of detectable virus particles exceeds the number of bacterial cells by 10-fold (Edwards and Rohwer, 2005; Rosario and Breitbart, 2011; Mokili et al, 2012; Chow and Suttle, 2015). Bacterial viruses represent the most numerous viral entities, and they affect host bacteria (Breitbart and Rohwer, 2005). Many metagenomic studies rely on the approach of selectively capturing and sequencing viral particles outside prokaryotic host cells; sequencing cellular fraction samples can reveal viral sequences. Existing virome metagenomic studies cannot capture sequences from viruses replicating in prokaryotic host cells
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.