Abstract
There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have