Abstract
Despite its clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. We use comparative genomics to provide a high-confidence protein-coding gene set, characterize evolutionary constraint, and prioritize functional mutations. We select 44 Sarbecovirus genomes at ideally-suited evolutionary distances, and quantify protein-coding evolutionary signatures and overlapping constraint. We find strong protein-coding signatures for ORFs 3a, 6, 7a, 7b, 8, 9b, and a novel alternate-frame gene, ORF3c, whereas ORFs 2b, 3d/3d-2, 3b, 9c, and 10 lack protein-coding signatures or convincing experimental evidence of protein-coding function. Furthermore, we show no other conserved protein-coding genes remain to be discovered. Mutation analysis suggests ORF8 contributes to within-individual fitness but not person-to-person transmission. Cross-strain and within-strain evolutionary pressures agree, except for fewer-than-expected within-strain mutations in nsp3 and S1, and more-than-expected in nucleocapsid, which shows a cluster of mutations in a predicted B-cell epitope, suggesting immune-avoidance selection. Evolutionary histories of residues disrupted by spike-protein substitutions D614G, N501Y, E484K, and K417N/T provide clues about their biology, and we catalog likely-functional co-inherited mutations. Previously reported RNA-modification sites show no enrichment for conservation. Here we report a high-confidence gene set and evolutionary-history annotations providing valuable resources and insights on SARS-CoV-2 biology, mutations, and evolution.
Highlights
Despite its clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology
In order to resolve the SARSCoV-2 protein-coding gene set, we need to first clarify what we mean by open-reading frames (ORFs) and protein-coding gene since the terms are used with slightly different meanings by different authors
We do not require an ORF to be translated or exceed any minimum length. It is standard in the bioinformatics community to define ORF in a way that does not require evidence of translation, though this definition might be less familiar in the virological community
Summary
The SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. We use comparative genomics to provide a high-confidence protein-coding gene set, characterize evolutionary constraint, and prioritize functional mutations. The last third of the genome encodes four named proteins that are present in all coronaviruses, namely S (spike surface glycoprotein), composed of S1 (viral attachment to host-cell ACE2 receptor) and S2 (membrane fusion, viral entry), E (envelope protein), M (membrane glycoprotein), and N (nucleocapsid phosphoprotein, RNA genome packaging). Their host-cell translation requires subgenomic RNAs of varying lengths, such that each functional ORF is first (or early) on its own transcript[8]. These subgenomic RNAs result from synthesis of negative-sense intermediates by transcription starting from the 3′ end of the genomic RNA, extending to one of several internal transcriptionregulatory sequences (TRS), and looping to a common 5′ leader sequence; the negative-sense intermediates are used as templates for synthesis of positive-sense subgenomic RNAs3,9
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have