Abstract

The emergence and establishment of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of interest (VOIs) and variants of concern (VOCs) highlight the importance of genomic surveillance. We propose a statistical learning strategy (SLS) for identifying and spatiotemporally tracking potentially relevant Spike protein mutations. We analyzed 167,893 Spike protein sequences from coronavirus disease 2019 (COVID-19) cases in the United States (excluding 21,391 sequences from VOI/VOC strains) deposited at GISAID from 19 January 2020 to 15 March 2021. Alignment against the reference Spike protein sequence led to the identification of viral residue variants (VRVs), i.e., residues harboring a substitution compared to the reference strain. Next, generalized additive models were applied to model VRV temporal dynamics and to identify VRVs with significant and substantial dynamics (false discovery rate q-value < 0.01; maximum VRV proportion >10% on at least one day). Unsupervised learning was then applied to hierarchically organize VRVs by spatiotemporal patterns and identify VRV-haplotypes. Finally, homology modeling was performed to gain insight into the potential impact of VRVs on Spike protein structure. We identified 90 VRVs, 71 of which had not previously been observed in a VOI/VOC, and 35 of which have emerged recently and are durably present. Our analysis identified 17 VRVs ~91 days earlier than their first corresponding VOI/VOC publication. Unsupervised learning revealed eight VRV-haplotypes of four VRVs or more, suggesting two emerging strains (B1.1.222 and B.1.234). Structural modeling supported a potential functional impact of the D1118H and L452R mutations. The SLS approach equally monitors all Spike residues over time, independently of existing phylogenic classifications, and is complementary to existing genomic surveillance methods.

Highlights

  • Introduction distributed under the terms andSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the pathogen responsible for the global coronavirus disease 2019 (COVID-19) pandemic, is an RNA virus and prone to replication errors [1]

  • We describe a statistical learning strategy (SLS) using generalized additive models, unsupervised learning techniques, and single nucleotide polymorphism (SNP) methodologies for identifying and spatiotemporally characterizing viral residue variants (VRVs), a term we use to describe amino acid (AA) positions in the Spike protein where a mutation is significantly present in a given geographic area

  • The decision to compartmentalize VRV discovery by state/territory was partially based on the fact that domestic travel restrictions have varied over the course of the pandemic, with nearly half of all states having imposed some type of interstate travel restriction [21], leading to the hypothesis that VRVs may follow state/territory-specific temporal dynamics

Read more

Summary

Introduction

Introduction distributed under the terms andSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the pathogen responsible for the global coronavirus disease 2019 (COVID-19) pandemic, is an RNA virus and prone to replication errors [1]. For SARS-CoV-2, there are four classes of variant: variant being monitored (VBM), variant of interest (VOI), variant of concern (VOC), and variant of high consequence (VOHC). VOCs show specific attributes such as increased transmissibility [3,4,5,6], increased resistance to neutralization by antibodies elicited through natural infection [3,7,8,9], and/or increased resistance to neutralization by vaccine-elicited antibodies [8,10,11], and have already influenced vaccine development, evidenced by the current planning of clinical trials to test variant-adapted vaccines [12]. The identification of VOHCs could necessitate the introduction of more stringent public health guidelines and/or spur further treatment and vaccine development

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call