Abstract
Since the identification of SARS-CoV-2, a large number of genomes have been sequenced with unprecedented speed around the world. This marks a unique opportunity to analyze virus spreading and evolution in a worldwide context. Currently, there is not a useful haplotype description to help to track important and globally scattered mutations. Also, differences in the number of sequenced genomes between countries and/or months make it difficult to identify the emergence of haplotypes in regions where few genomes are sequenced but a large number of cases are reported. We propose an approach based on the normalization by COVID-19 cases of relative frequencies of mutations using all the available data to identify major haplotypes. Furthermore, we can use a similar normalization approach to tracking the temporal and geographic distribution of haplotypes in the world. Using 171,461 genomes, we identify five major haplotypes or operational taxonomic units (OTUs) based on nine high-frequency mutations. OTU_3 characterized by mutations R203K and G204R is currently the most frequent haplotype circulating in four of the six continents analyzed (South America, North America, Europe, Asia, Africa, and Oceania). On the other hand, during almost all months analyzed, OTU_5 characterized by the mutation T85I in nsp2 is the most frequent in North America. Recently (since September), OTU_2 has been established as the most frequent in Europe. OTU_1, the ancestor haplotype, is near to extinction showed by its low number of isolations since May. Also, we analyzed whether age, gender, or patient status is more related to a specific OTU. We did not find OTU’s preference for any age group, gender, or patient status. Finally, we discuss structural and functional hypotheses in the most frequently identified mutations, none of those mutations show a clear effect on the transmissibility or pathogenicity.
Highlights
Organization on March 11th, 2020 (Cucinotta and Vanelli, 2020), with around 71 million cases and 1.6 million deaths around the world (December 14th, 2020; WHO, 2020), quickly becoming the most important health concern in the world
Subalignments corresponding to genomes divided by continent-month combinations were extracted and relative frequencies of each base or gap in each genomic position were calculated ( RFp, m−c ) using a python script
These relative frequencies were multiplied by the number of cases reported in the respective continent-month combination (CN m−c )
Summary
Organization on March 11th, 2020 (Cucinotta and Vanelli, 2020), with around 71 million cases and 1.6 million deaths around the world (December 14th, 2020; WHO, 2020), quickly becoming the most important health concern in the world. Since the publication on January 24th, 2020 of the first complete genome sequence of SARS-CoV-2 from China (Zhu et al, 2020), thousands of genomes have been sequenced in a great number of countries on all six continents and were made available in several databases. This marks a milestone in scientific history and gives us an unprecedented opportunity to study how a specific virus evolves in a worldwide context. In March 2020, two major lineages were proposed based on position 8,782 and 28,144 using a data set of 103 genomes (Tang et al, 2020) which was followed by a interesting proposal that identified the same major lineages (named A and B) and other sublineages (Rambaut et al, 2020)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.