Real Time Classification of Viruses in 12 Dimensions

Chenglong Yu,Jie Yang,Troy Hernandez,Hsin-Hsiung Huang,Shek-Chung Yau,Rong Lucy He,Hui Zheng,Stephen S.-T Yau

doi:10.1371/journal.pone.0064328

Chenglong Yu, Jie Yang + Show 6 more

Open Access

PDF Available

https://doi.org/10.1371/journal.pone.0064328

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

The International Committee on Taxonomy of Viruses authorizes and organizes the taxonomic classification of viruses. Thus far, the detailed classifications for all viruses are neither complete nor free from dispute. For example, the current missing label rates in GenBank are 12.1% for family label and 30.0% for genus label. Using the proposed Natural Vector representation, all 2,044 single-segment referenced viral genomes in GenBank can be embedded in . Unlike other approaches, this allows us to determine phylogenetic relations for all viruses at any level (e.g., Baltimore class, family, subfamily, genus, and species) in real time. Additionally, the proposed graphical representation for virus phylogeny provides a visualization of the distribution of viruses in . Unlike the commonly used tree visualization methods which suffer from uniqueness and existence problems, our representation always exists and is unique. This approach is successfully used to predict and correct viral classification information, as well as to identify viral origins; e.g. a recent public health threat, the West Nile virus, is closer to the Japanese encephalitis antigenic complex based on our visualization. Based on cross-validation results, the accuracy rates of our predictions are as high as 98.2% for Baltimore class labels, 96.6% for family labels, 99.7% for subfamily labels and 97.2% for genus labels.

Highlights

The rapid development of sequencing technologies produces a large number of viral genome sequences
After checking the consistency between Baltimore classification and International Committee on Taxonomy of Viruses (ICTV) families, we find that the original GenBank records of the viruses in the Retroviridae family (RNA viruses) contain erroneous DNA label information
Predict Baltimore class label For each virus we find its nearest neighbor in the 12

Summary

Introduction

The rapid development of sequencing technologies produces a large number of viral genome sequences. Unlike k-mer methods, which ignore the positional information of nucleotides, the natural vector characterization constructs a one-to-one correspondence between genome sequences and numerical vectors [10]. Along this line, we construct a viral genome space in R12 based on the quantity and global distribution of nucleotides in viral sequences. The Euclidean distance between two points represents the biological distance of the corresponding two viruses This allows us to make a simultaneous comparison against all available viruses at any level (e.g., Baltimore class, family, subfamily, genus, and species) in a fast and efficient manner. We propose a two-dimensional graphical representation of viruses in the genome space which is unique and does not depend on any model assumption

Methods

Results

Conclusion