The aim of this study was to evaluate the use of machine learning for rapid classification of arbovirus genomes. Initially, genomic sequences of 17 distinct arboviruses were collected from the National Center for Biotechnology Information database. Genomic sequences of arthropod-specific virus were also collected to compose a separate class, representing a “non-arbovirus” group. Subsequently, the sequences were transformed into canonical k-mer frequencies and used to train supervised classification algorithms such as multinomial logistic regression, decision tree, k-nearest neighbors, support vector machine and multilayer perceptron. Six distinct k-mer values within a range of 1 to 6 were also evaluated. Using 10-fold cross-validation as an evaluation method, the supervised model created with multilayer perceptron and k-mer value 6 presented the best average accuracy (98.8%). In order to evaluate the generalization capacity of the best model obtained, classifications were made using genomic sequences not present in the training database. The results of the classifications generated were evaluated by the metrics of accuracy, precision, recall and f1-score, obtaining values of 98.5%, 98.3%, 98.2% and 98.2%, respectively. Finally, the best model obtained was incorporated into a web application, which allows the input of virus genomic sequences and their classification. The application is freely available for use at https://arbovirusclassifiercanonicalkmer-8fndyh3tsxrftmr66jmpas.streamlit.app.
Read full abstract