Abstract

The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral “tree of life”. However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. The resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.

Highlights

  • Phylogenomic dendograms constructed using whole-genome sequences are based on a more complete set of genomic information than phylogenies based on individual genes[22]

  • In previous studies of dsDNA eukaryotic viruses[15,16,21], the optimal feature length was based on cumulative relative entropy (CRE) and relative sequence divergence (RSD)

  • We found that RSD cannot monotonically decrease when k increases, which is probably because this huge dimensional k-mer space can cover artificial k-mers (k-mers derived from random sequences), even though their probabilities are quite low

Read more

Summary

Introduction

Phylogenomic dendograms constructed using whole-genome sequences are based on a more complete set of genomic information than phylogenies based on individual genes[22]. The primary advantage of these methods is that they enable quick genome-scale comparisons with linear time complexity (O(n))[28] more efficiently than minimum likelihood or Bayesian alignment methods with subquadratic time complexity (o(n2)). Another advantage of alignment-free methods is that they can be used to compare sequences from draft genomes, with information loss proportional to the number of discontinuities in a genome. Determining RSD values becomes increasingly computationally complex as the number of genomes grows This increase in complexity is due, in part, to an increase in the density of the k-mer feature space. We found that RSD cannot monotonically decrease when k increases, which is probably because this huge dimensional k-mer space can cover artificial k-mers (k-mers derived from random sequences), even though their probabilities are quite low

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call