Abstract

BackgroundRNA viruses, including severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), are important human pathogens. Sequencing of the proteins produced by RNA viruses is essential for understanding disease pathogenesis and may have diagnostic and therapeutic implications. We aimed to develop an accurate and computationally efficient handcrafted feature engineering model for classifying the protein sequences of six pathogenic RNA viruses: SARS-CoV-2, influenza A, influenza B, influenza C, human respirovirus 3, and human immunodeficiency virus (HIV)-1. The first five cause primary respiratory infections; the last has some functional similarity with SARS-CoV-2, justifying the need for diagnostic differentiation. Materials and methodWe downloaded 14,787 protein sequences belonging to the six categories in FASTA format from the open-source National Center for Biotechnology Information database and transformed the sequences into numeric arrays. First, the signal was divided into overlapping blocks representing three amino acids. Tiny textural motif pattern, a new histogram-based feature extractor, was then applied to extract textural features using simple signum, lower, and upper ternary functions. 512 features were extracted for each protein sequence and fed to an iterative neighborhood component analysis function to select a study dataset-specific optimal number (34) of the most discriminative features for downstream classification using a shallow k-nearest neighbor classifier with 10-fold cross-validation.Novelties: An efficient linear time complexity is introduced for data classification, providing a robust classification approach, especially for complex datasets. Notably, this approach extends beyond the traditional binary classification focus, successfully distinguishing up to six distinct classes. Furthermore, a novel handcrafted feature extraction method is developed, significantly enhancing data analysis and yielding more precise results. ResultsThe model attained 99.71% overall 6-class classification accuracy in a data subset and 99.85% for binary classification of SARS-CoV-2 vs. HIV-1, outperforming a similar published model. ConclusionsOur simple model accurately classified the protein sequences of six pathogenic RNA viruses and can potentially be implemented in diagnostic applications to improve RNA virus disease screening.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call