Abstract

Predicting the hosts of newly discovered viruses is important for pandemic surveillance of infectious diseases. We investigated the use of alignment-based and alignment-free methods and support vector machine using mononucleotide frequency and dinucleotide bias to predict the hosts of viruses, and applied these approaches to three datasets: rabies virus, coronavirus, and influenza A virus. For coronavirus, we used the spike gene sequences, while for rabies and influenza A viruses, we used the more conserved nucleoprotein gene sequences. We compared the three methods under different scenarios and showed that their performances are highly correlated with the variability of sequences and sample size. For conserved genes like the nucleoprotein gene, longer k-mers than mono- and dinucleotides are needed to better distinguish the sequences. We also showed that both alignment-based and alignment-free methods can accurately predict the hosts of viruses. When alignment is difficult to achieve or highly time-consuming, alignment-free methods can be a promising substitute to predict the hosts of new viruses.

Highlights

  • Viruses are ubiquitous and can reproduce and evolve very fast

  • We initially calculated the prediction accuracies of the K-nearest neighbors (KNN) algorithm based on the alignment method and the alignment-free distance/dissimilarity measures for k-mer length from 3 to 6 and the number of neighbors K from 1 to 10

  • The results for the rabies virus, coronavirus, and influenza A virus datasets are given as Figs S1, S2 and S3 in the supplementary material, respectively

Read more

Summary

Introduction

Viruses are ubiquitous and can reproduce and evolve very fast. Virus infections in human can cause various diseases and are a big threat to human health. With the availability of various databases containing different types of pathogenic microbial species, one of the most commonly used approaches for identifying the origin of the new pathogen responsible for an EID is to find similar sequences in the pathogen databases using alignment by the Smith-Waterman algorithm[6], BLAST7, or other alignment tools. Several alignment-free methods have been developed for the identification of the hosts of pathogenic species. Aguas and Ferguson[9] developed a feature selection method and used random forests (RF) based on the diverged nucleotide or amino acid bases among a set of aligned molecular sequences to predict the host species of pathogens. We investigate the effectiveness of alignment, alignment-free and machine learning based methods for inferring the hosts of viruses responsible for emerging infectious diseases

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call