Abstract

To confront the global threat of coronavirus disease 2019, a massive number of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome sequences have been decoded, with the results promptly released through the GISAID database. Based on variant types, eight clades have already been defined in GISAID, but the diversity can be far greater. Owing to the explosive increase in available sequences, it is important to develop new technologies that can easily grasp the whole picture of the big-sequence data and support efficient knowledge discovery. An ability to efficiently clarify the detailed time-series changes in genome-wide mutation patterns will enable us to promptly identify and characterize dangerous variants that rapidly increase their population frequency. Here, we collectively analyzed over 150,000 SARS-CoV-2 genomes to understand their overall features and time-dependent changes using a batch-learning self-organizing map (BLSOM) for oligonucleotide composition, which is an unsupervised machine learning method. BLSOM can separate clades defined by GISAID with high precision, and each clade is subdivided into clusters, which shows a differential increase/decrease pattern based on geographic region and time. This allowed us to identify prevalent strains in each region and to show the commonality and diversity of the prevalent strains. Comprehensive characterization of the oligonucleotide composition of SARS-CoV-2 and elucidation of time-series trends of the population frequency of variants can clarify the viral adaptation processes after invasion into the human population and the time-dependent trend of prevalent epidemic strains across various regions, such as continents.

Highlights

  • The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread rampantly worldwide since it was first reported in December 2019, and the momentum of its spread is still increasing (WHO. 2020)

  • Considering the clade-independent tendency primarily caused by the APOBEC enzymes (Simmonds 2020), we performed batch-learning self-organizing map (BLSOM) analysis of the pentanucleotide composition and their odds ratio, which can reduce the effects caused by changes in mononucleotide composition

  • BLSOM is equipped with various tools that allow us to visualize the analysis results in an understandable way and to visualize differences in the number of subcluster sequences among continents (Figure 2) and their timeseries changes (Figure 3), i.e., the distinct variations in the resulting subclusters depending on the region and the collection time

Read more

Summary

Introduction

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread rampantly worldwide since it was first reported in December 2019, and the momentum of its spread is still increasing (WHO. 2020). It is imperative to develop a sequence alignment-free method that will enable us to grasp the whole picture of the big-sequence data and support efficient knowledge discovery from them. By focusing on the frequency of short oligonucleotides (e.g., tetra- and penta-nucleotides) in a large number of genomic fragments (e.g., 10 kb) derived from a wide variety of species, we have developed an unsupervised explainable AI (batch-learning self-organizing map; BLSOM), which enables separation (self-organization) of the genomic sequences by species and phylogeny and explains the causes that contribute to this separation (Abe et al 2003 ). In the analysis of genomic fragments of a wide range of microbial genomes, over 5 million sequences can be separated by phylogenetic groups with high accuracy (Abe et al 2020)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.