Abstract

The emergence of new variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major concern given their potential impact on the transmissibility and pathogenicity of the virus as well as the efficacy of therapeutic interventions. Here, we predict the mutability of all positions in SARS-CoV-2 protein domains to forecast the appearance of unseen variants. Using sequence data from other coronaviruses, preexisting to SARS-CoV-2, we build statistical models that not only capture amino acid conservation but also more complex patterns resulting from epistasis. We show that these models are notably superior to conservation profiles in estimating the already observable SARS-CoV-2 variability. In the receptor binding domain of the spike protein, we observe that the predicted mutability correlates well with experimental measures of protein stability and that both are reliable mutability predictors (receiver operating characteristic areas under the curve ∼0.8). Most interestingly, we observe an increasing agreement between our model and the observed variability as more data become available over time, proving the anticipatory capacity of our model. When combined with data concerning the immune response, our approach identifies positions where current variants of concern are highly overrepresented. These results could assist studies on viral evolution and future viral outbreaks and, in particular, guide the exploration and anticipation of potentially harmful future SARS-CoV-2 variants.

Highlights

  • The emergence of new variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major concern given their potential impact on the transmissibility and pathogenicity of the virus as well as the efficacy of therapeutic interventions

  • One of the most important lessons of these studies is the importance of epistasis, i.e., the dependence of mutational effects on other preexisting mutations: Epistatic models outperform significantly simpler nonepistatic modeling approaches based on independent conservation patterns of individual residue positions

  • According to the Pfam protein-domain family database [28], the SARS-CoV-2 proteome contains 39 protein domains covering 81% (7,860 out of 9,748 residues) of the entire proteome. For each of these domains, we predict the mutability using both the epistatic direct coupling analysis (DCA) and the independent independent-site model (IND) models following the general scheme illustrated in Fig. 1 and detailed in Materials and Methods:

Read more

Summary

Introduction

The emergence of new variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major concern given their potential impact on the transmissibility and pathogenicity of the virus as well as the efficacy of therapeutic interventions. Data-driven models trained on sequence data of patients affected by HIV have been used in this spirit They identify regions subject to strong selective constraints and less likely to variate [15, 16], guiding the immunogen design of therapeutic strategies being effective against current and future HIV strains [17, 18]. Such approaches are trained on large amounts of HIV sequence data, resulting from decades of study and high rates of intrapatient evolution [19]. The exposed regions of the spike protein have accumulated a large

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.