Abstract

The aim of the study is to analyze viruses using parameters obtained from distributions of nucleotide sequences in the viral RNA. Seeking for the input data homogeneity, we analyze single-stranded RNA viruses only. Two approaches are used to obtain the nucleotide sequences; In the first one, chunks of equal length (four nucleotides) are considered. In the second approach, the whole RNA genome is divided into parts by adenine or the most frequent nucleotide as a “space”. Rank–frequency distributions are studied in both cases. The defined nucleotide sequences are signs comparable to a certain extent to syllables or words as seen from the nature of their rank–frequency distributions. Within the first approach, the Pólya and the negative hypergeometric distribution yield the best fit. For the distributions obtained within the second approach, we have calculated a set of parameters, including entropy, mean sequence length, and its dispersion. The calculated parameters became the basis for the classification of viruses. We observed that proximity of viruses on planes spanned on various pairs of parameters corresponds to related species. In certain cases, such a proximity is observed for unrelated species as well calling thus for the expansion of the set of parameters used in the classification. We also observed that the fifth most frequent nucleotide sequences obtained within the second approach are of different nature in case of human coronaviruses (different nucleotides for MERS, SARS-CoV, and SARS-CoV-2 versus identical nucleotides for four other coronaviruses). We expect that our findings will be useful as a supplementary tool in the classification of diseases caused by RNA viruses with respect to severity and contagiousness.

Highlights

  • Studies of genomes based on linguistic approaches date a few decades back (Brendel et al 1986; Pevzner et al 1989; Searls 1992; Botstein and Cherry 1997; Gimona 2006; Faltynek et al 2019; Ji 2020)

  • We observed that the fifth most frequent nucleotide sequences obtained within the second approach are of different nature in case of human coronaviruses

  • When looking in detail into the rank–frequency distributions corresponding to coronaviruses we have discovered the following pattern: the first rank is always occupied by “X” followed by three single-nucleotide “words” with ranks 2–4, while the fifth ranks are occupied by a two-nucleotide sequence with either the same (4-same) or different (4-diff) nucleotides, see Table 4

Read more

Summary

Introduction

Studies of genomes based on linguistic approaches date a few decades back (Brendel et al 1986; Pevzner et al 1989; Searls 1992; Botstein and Cherry 1997; Gimona 2006; Faltynek et al 2019; Ji 2020). Various types of sequences in genomes are related to multiple genetic codes (Trifonov et al 2012) and can be studied both using quantitative linguistic point of view (Ferrer-i-Cancho et al 2013; Ferrer-i-Cancho et al 2014) and from a wider perspective, within more abstract approaches (Neuman and Nave 2008; Barbieri 2012). Neural networks and deep learning algorithms emerged as new tools to analyze nucleotide sequences (Fang et al 2019; Singh et al 2019; Melkus et al 2020; Ren et al 2020) offering wider prospects for studies of genomes. Viruses, balancing on the fuzzy border between non-alive and alive, remaining on the verge of life (Villarreal 2004; Kolb 2007; Carsetti 2020), are within the most interesting subjects of studies

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call