Approaches to Defining the “Hate Element” of a Behavior: A Data-Driven Typology
This article addresses the proliferation of definitions and approaches used to characterize the hate element in behaviors motivated by hate, including hate crimes, hate speech, and behaviors motivated by prejudice against specific identities (e.g., homophobia, anti-Semitism, Islamophobia), and investigates whether these definitions cluster into distinct types. Using machine learning, we clustered 423 definitions from academic and gray literature in five languages between 1990 and 2021, based on 16 theoretically derived categories. The resulting typology captures the diversity of definitions from ten countries in North America, Europe, and Oceania, providing a comprehensive framework for understanding how the hate element is conceptualized in these contexts. The findings offer a basis for future research and may help inform policy responses to hate-motivated behaviors.
- Research Article
18
- 10.1002/cl2.1397
- Apr 28, 2024
- Campbell systematic reviews
Mapping the scientific knowledge and approaches to defining and measuring hate crime, hate speech, and hate incidents: A systematic review.
- Research Article
74
- 10.1109/access.2020.2968173
- Jan 1, 2020
- IEEE Access
In recent times, South Africa has been witnessing insurgence of offensive and hate speech along racial and ethnic dispositions on Twitter. Popular among the South African languages used is English. Although, machine learning has been successfully used to detect offensive and hate speech in several English contexts, the distinctiveness of South African tweets and the similarities among offensive, hate and free speeches require domain-specific English corpus and techniques to detect the offensive and hate speech. Thus, we developed an English corpus from South African tweets and evaluated different machine learning techniques to detect offensive and hate speech. Character n-gram, word n-gram, negative sentiment, syntactic-based features and their hybrid were extracted and analyzed using hyper-parameter optimization, ensemble and multi-tier meta-learning models of support vector machine, logistic regression, random forest, gradient boosting algorithms. The results showed that optimized support vector machine with character n-gram performed best in detection of hate speech with true positive rate of 0.894, while optimized gradient boosting with word n-gram performed best in detection of hate speech with true positive rate of 0.867. However, their performances in detection of other threatening classes were poor. Multi-tier meta-learning models achieved the most consistent and balanced classification performance with true positive rates of 0.858 and 0.887 for hate speech and offensive speech, respectively as well as true positive rate of 0.646 for free speech and overall accuracy of 0.671. The error analysis showed that multi-tier meta-learning model could reduce the misclassification error rate of the optimized models by 34.26%.
- Conference Article
28
- 10.1109/icter48817.2019.9023655
- Sep 1, 2019
With the rapid growth of Information technology and Computer Science, communication and presenting ideologies became easier than early decades. Since Social Media are available globally through the web, anyone can easily target a person or a group who belongs to a different culture or a different belief. Though everyone has a right to express his or her own ideas, it should not be harmful, as everyone has a right to be prevented from any kind of hate speeches. In Social Media, there are no automatic methods to detect a hate speech, so anyone can easily be targeted. Since social media service providers do not have good linguistic knowledge on some languages such as Sinhala, they may take a couple of days to remove hate related comments from the content once they noticed. Therefore, hate speech detection in Sinhala language is an urgent and important work to address. We propose lexicon based and machine learning based approaches to automatically detect Sinhala hate and offensive speeches that are being shared through Social Media. In our study, lexicon based approach was initiated with the lexicon generating process and corpus based lexicon gave 76.3% of accuracy for hate, offensive and neutral speech detection. Machine learning approach was begun with building a 3000 comments corpus which is evenly distributed among hate, offensive and neutral speeches. Using this comment corpus, we were able to identify best fitting feature groups and models for Sinhala hate speech detection. According to our experiments, character trigram with Multinomial Naive Bayes gave the highest recall value as 0.84 with 92.33% accuracy.
- Research Article
11
- 10.1111/add.16170
- Mar 7, 2023
- Addiction (Abingdon, England)
The most available data on the prevalence of cannabis use come from population surveys conducted in high-income countries in North America, Oceania and Europe. Less is known about the prevalence of cannabis use in Africa. This systematic review aimed to summarize general population-level cannabis use in sub-Saharan Africa since 2010. A comprehensive search was conducted in PubMed, EMBASE, PsycINFO and AJOL databases in addition to Global Health Data Exchange and grey literature without language restriction. Search terms related to 'substance', 'Substance-Related Disorders' and 'Prevalence' and 'Africa South of the Sahara' were used. Studies that reported cannabis use in the general population were selected, while studies from clinical populations and high-risk groups were excluded. Prevalence data on cannabis use in the general population of adolescents (10-17 years) and adults (≥ 18 years) in sub-Saharan Africa were extracted. The study included 53 studies for the quantitative meta-analysis and included 13 239 participants. Among adolescents, the life-time, 12-month and 6-month prevalence of cannabis use were 7.9% [95% confidence interval (CI) = 5.4-10.9%], 5.2% (95% CI = 1.7-10.3%) and 4.5% (95% CI = 3.3-5.8%), respectively. The corresponding life-time, 12-month and 6-month prevalence of cannabis use among adults were 12.6% (95% CI = 6.1-21.2%), 2.2% (95% CI = 1.7-2.7%, with data only available from Tanzania and Uganda) and 4.7% (95% CI = 3.3-6.4%), respectively. The male-to-female life-time cannabis use relative risk was 1.90 (95% CI = 1.25-2.98) among adolescents and 1.67 (CI = 0.63-4.39) among adults. Life-time cannabis use prevalence in sub-Saharan Africa appears to be approximately 12% for adults and just under 8% for adolescents.
- Research Article
57
- 10.1371/journal.pone.0057107
- Feb 20, 2013
- PLoS ONE
This study examines the regional and temporal differences in the statistical relationship between national-level carbon dioxide emissions and national-level population size. The authors analyze panel data from 1960 to 2005 for a diverse sample of nations, and employ descriptive statistics and rigorous panel regression modeling techniques. Initial descriptive analyses indicate that all regions experienced overall increases in carbon emissions and population size during the 45-year period of investigation, but with notable differences. For carbon emissions, the sample of countries in Asia experienced the largest percent increase, followed by countries in Latin America, Africa, and lastly the sample of relatively affluent countries in Europe, North America, and Oceania combined. For population size, the sample of countries in Africa experienced the largest percent increase, followed countries in Latin America, Asia, and the combined sample of countries in Europe, North America, and Oceania. Findings for two-way fixed effects panel regression elasticity models of national-level carbon emissions indicate that the estimated elasticity coefficient for population size is much smaller for nations in Africa than for nations in other regions of the world. Regarding potential temporal changes, from 1960 to 2005 the estimated elasticity coefficient for population size decreased by 25% for the sample of Africa countries, 14% for the sample of Asia countries, 6.5% for the sample of Latin America countries, but remained the same in size for the sample of countries in Europe, North America, and Oceania. Overall, while population size continues to be the primary driver of total national-level anthropogenic carbon dioxide emissions, the findings for this study highlight the need for future research and policies to recognize that the actual impacts of population size on national-level carbon emissions differ across both time and region.
- Research Article
1
- 10.32628/ijsrset2512312
- May 9, 2025
- International Journal of Scientific Research in Science, Engineering and Technology
Hate speech detection is a critical aspect of online content moderation, ensuring that digital platforms remain safe and inclusive. With the exponential rise of social media, harmful content such as hate speech and offensive language has increased, necessitating automated solutions for effective moderation. This project employs Natural Language Processing (NLP) and Machine Learning (ML) techniques to classify tweets into three categories: Hate Speech, Offensive Speech, and No Hate or Offensive Speech. By leveraging a Decision Tree Classifier, the system efficiently detects and categorizes harmful content while reducing manual intervention. The methodology involves data preprocessing, feature extraction using CountVectorizer, and training a classification model to achieve high accuracy. The proposed system overcomes the limitations of traditional keyword-based filtering by improving context awareness and scalability. The implementation is designed to process large volumes of data, making it highly suitable for real-world applications. This approach enhances digital safety, minimizes human effort in moderation, and ensures compliance with ethical standards. Future improvements may include the integration of deep learning models like LSTMs or Transformers and real-time social media API monitoring to enhance accuracy further. This project contributes to the growing need for robust and automated hate speech detection solutions in the digital era.
- Research Article
13
- 10.3389/fpubh.2023.952069
- Feb 7, 2023
- Frontiers in Public Health
On March 16, 2021, a white man shot and killed eight victims, six of whom were Asian women at Atlanta-area spa and massage parlors. The aims of the study were to: (1) qualitatively summarize themes of tweets related to race, ethnicity, and racism immediately following the Atlanta spa shootings, and (2) examine temporal trends in expressions hate speech and solidarity before and after the Atlanta spa shootings using a new methodology for hate speech analysis. A random 1% sample of publicly available tweets was collected from January to April 2021. The analytic sample included 708,933 tweets using race-related keywords. This sample was analyzed for hate speech using a newly developed method for combining faceted item response theory with deep learning to measure a continuum of hate speech, from solidarity race-related speech to use of violent, racist language. A qualitative content analysis was conducted on random samples of 1,000 tweets referencing Asians before the Atlanta spa shootings from January to March 15, 2021 and 2,000 tweets referencing Asians after the shooting from March 17 to 28 to capture the immediate reactions and discussions following the shootings. Qualitative themes that emerged included solidarity (4% before the shootings vs. 17% after), condemnation of the shootings (9% after), racism (10% before vs. 18% after), role of racist language during the pandemic (2 vs. 6%), intersectional vulnerabilities (4 vs. 6%), relationship between Asian and Black struggles against racism (5 vs. 7%), and discussions not related (74 vs. 37%). The quantitative hate speech model showed a decrease in the proportion of tweets referencing Asians that expressed racism (from 1.4% 7 days prior to the event from to 1.0% in the 3 days after). The percent of tweets referencing Asians that expressed solidarity speech increased by 20% (from 22.7 to 27.2% during the same time period) (p < 0.001) and returned to its earlier rate within about 2 weeks. Our analysis highlights some complexities of discrimination and the importance of nuanced evaluation of online speech. Findings suggest the importance of tracking hate and solidarity speech. By understanding the conversations emerging from social media, we may learn about possible ways to produce solidarity promoting messages and dampen hate messages.
- Book Chapter
2
- 10.1007/978-981-16-9605-3_61
- Jan 1, 2022
Hate speech is not uncommon and is likely practiced almost on every networking platform. In recent times, due to exponential increase in Internet users and events such as the unprecedented pandemic and lockdown, it showed increased usage of social platforms for communicating thoughts, opinions, and ideas. Hate speech has a strong impact on people’s lives and is one of the reasons for suicidal events. There is certainly a strong need to make progress toward the mitigation of hate speech. Detection is the primary step to eradicate hate speech. In the following paper, the comparative analysis of different machine learning algorithms to detect hate speech was shown. Data from the Twitter social platform was considered. From the analysis, it was found that the long short-term memory method is a highly performant machine learning algorithm.KeywordsClassificationHate speechMachine learningNatural language processing
- Research Article
- 10.4396/sfl2019es08
- Aug 18, 2020
The intervention shows the first results of a research conducted on a corpus of 7000 posts collected on the Reddit social network during the 2016 American presidential campaign. The research is the result of a collaboration between Berkeley D-Lab, who shared the corpus, LSI - CentraleSupelec and CUBE. Thanks to funding from the Anti-Defamation League, the corpus has been labeled to apply Machine Learning techniques: 400 posts have been labeled as “hate speech” by human analysts. Galofaro, Toffano and Doan applied to both sub-corpora (hate and non-hate speeches) an analysis technique inspired by Greimas’s structural semantics, Eco’s semiotics, and Quantum Information Retrieval (van Rijsbergen).Each text was formalized as a semantic network using the HAL technique. We then measured the semantic similarity between two key words formalized as two word-vectors with the classical measure of cosine-similarity and then compared it with the degree of quantum correlation between them measured with the Born rule. This correlation, linked to the co-occurrence of the word vectors in the same contexts, extracts from the latter useful information to characterize the considered semantic relationships (“presence of correlation”, “absence of correlation” or “presence of anti-correlation”). In this way, the new technique allows to overcome some critical aspects of the Machine Learning techniques currently in use, being based on the meaning of the text and not on the way in which the human analyst labels the corpus.
- Research Article
- 10.7816/nesne-09-22-11
- Dec 31, 2021
- Nesne Psikoloji Dergisi
Hate crime and hate speech are extreme examples of negative intergroup relations. It is thought that it would be very useful to analyze the variables that lead up to for dealing with hate speech and crimes that have many physical and psychological destructive consequences for the exposed group members. Therefore, the aim of the present study is to address some of the social psychological variables associated with hate speech and hate crimes and to suggest solutions to reduce hate speech and hate crimes in this context. For this purpose, first of all, hate speech and hate crimes were defined and various examples were presented in this direction. Later, hate crimes and hate speech were examined in terms of social identity identification, social dominance orientation, system justification, realistic and symbolic threat perception, frustration and scapegoat concepts. The relationship between hate speech and crimes of this concept has been embodied with research findings and examples from various regions in Turkey and the world. Finally, some solution suggestions have been presented by making use of this theoretical knowledge in terms of combating hate crimes and hate speeches. Keywords: Hate crime, hate speech, intergroup relations, social psychology
- Conference Article
- 10.1109/mlcss57186.2022.00060
- Aug 1, 2022
The online platform and social media are very eye catchy for internet users. Platforms like YouTube, Twitter, Instagram, etc., are higher in demand due to their brilliant services. Users of these sights frequently comment on others' posts which may contain toxic speech. Some platforms also raise concerns about emerging of this activity. As the increase of hate speech is just next to impossible to control, the need to detect these contents through automated hate speech detection technologies arises. In this work, we focused on multi-lingual issues, especially Indo-European code-mixed languages. At first, we identified some issues related to code-mixed Indian languages. Then, we focused on the available solutions to this problem. We have gone through the works performed on machine learning and deep learning techniques and identified the limitations of those works. We have analyzed the present solutions and gone through the comparative studies of those. Our implementation is conducted on code-mixed twitter datasets providing several perceptions on hate speech. We have performed the experimental work on HASOC 2021 dataset. Our work contributes to the field of hate speech detection by comparing feature extraction and classifier algorithms (Machine Learning and Deep Learning). More specifically, the proposed work aimed at distinguishing Hate and Non-Hate speech from normal text.
- Research Article
3
- 10.7717/peerj-cs.2138
- Jun 27, 2024
- PeerJ. Computer science
The recent rapid growth in the number of Saudi female athletes and sports enthusiasts' presence on social media has exposed them to gender-hate speech and discrimination. Hate speech, a harmful worldwide phenomenon, can have severe consequences. Its prevalence in sports has surged alongside the growing influence of social media, with X serving as a prominent platform for the expression of hate speech and discriminatory comments, often targeting women in sports. This research combines two studies that explores online hate speech and gender biases in the context of sports, proposing an automated solution for detecting hate speech targeting women in sports on platforms like X, with a particular focus on Arabic, a challenging domain with limited prior research. In Study 1, semi-structured interviews with 33 Saudi female athletes and sports fans revealed common forms of hate speech, including gender-based derogatory comments, misogyny, and appearance-related discrimination. Building upon the foundations laid by Study 1, Study 2 addresses the pressing need for effective interventions to combat hate speech against women in sports on social media by evaluating machine learning (ML) models for identifying hate speech targeting women in sports in Arabic. A dataset of 7,487 Arabic tweets was collected, annotated, and pre-processed. Term frequency-inverse document frequency (TF-IDF) and part-of-speech (POS) feature extraction techniques were used, and various ML algorithms were trained Random Forest consistently outperformed, achieving accuracy (85% and 84% using TF-IDF and POS, respectively) compared to other methods, demonstrating the effectiveness of both feature sets in identifying Arabic hate speech. The research contribution advances the understanding of online hate targeting Arabic women in sports by identifying various forms of such hate. The systematic creation of a meticulously annotated Arabic hate speech dataset, specifically focused on women's sports, enhances the dataset's reliability and provides valuable insights for future research in countering hate speech against women in sports. This dataset forms a strong foundation for developing effective strategies to address online hate within the unique context of women's sports. The research findings contribute to the ongoing efforts to combat hate speech against women in sports on social media, aligning with the objectives of Saudi Arabia's Vision 2030 and recognizing the significance of female participation in sports.
- Research Article
- 10.17159/obiter.v27i1.14430
- Jul 24, 2022
- Obiter
In 1996, the late Prof JMT Labuschagne wrote an article dealing with the limits of freedom of speech and hate speech (“Menseregtelike en Strafregtelike Bekamping van Groepsidentiteitmatige Krenking en Geweld” 1996 De Jure 23). He discussed freedom of expression and hate speech in the United States of America, various European countries, South Africa and also within the context of international law. He subsequently discussed the idea of updating his thoughts, taking into consideration the influence of the Constitution of the Republic of South Africa, 1996 and the Promotion of Equality and Prevention of Unfair Discrimination Act (4 of 2000, commonly referred to as the “Discrimination Act”). Sadly, he never got around to doing so. Since his 1996 article, much development has taken place in this field including the introduction of the 2004 Draft Prohibition of Hate Speech Bill. The events of 11 September 2001 in the USA and the 2005 bombings in London (and other similar attacks all over the world) have increased intolerance and suspicion between people from different races and religions manyfold. Immediately following the London bombings, it was reported that religious hate crime (that is, attacks targeting England’s Muslim community) had increased by nearly 600% (“Religious Hate Crime Up 600%” 2005-08-0221:14 SA http://www.news24.com visited 2 Aug 2005). Hate speech is regarded as an exception to freedom of speech/expression. The notion of freedom of expression has been discussed at length by various South African writers (Johannessen “A Critical View of the Constitutional Hate Speech Provision: Section 16” 1997 SAJHR 136; Devenish “Freedom of Expression: The ‘Marketplace’ of Ideas”1995 TSAR 442; Carpenter “Fundamental Rights: Is There a Pecking Order?” 1995 Codicillus 27; Johannessen “Freedom of Expression and Information in the New South African Constitution and Its Compatibility with International Standards” 1995 SAJHR 216; Van Rooyen “Censorship in a Future South Africa: A Legal Perspective” 1994 De Jure 283; Nesser “Hate Speech in the New South Africa: Constitutional Considerations for a Land Recovering from Decades of Racial Repression and Violence” 1994 SAJHR 336; and Marcus “Freedom of Expression Under the Constitution” 1994 SAJHR 140). This note briefly touches on some aspects relating to freedomof expression and hate speech and also explores the (rather newly discovered) notion of hate crime. It asks the question whether there is any connection between hate speech and hate crime.
- Research Article
7
- 10.1002/cl2.1228
- Apr 18, 2022
- Campbell systematic reviews
The overallaim of the review is to map the definitions and measurement tools used to capture the whole spectrum of hate motivated behaviors, including hate crime, hate speech and hate incidents. This will benefit the field of hate studies by providing a baseline that can inform the building of cumulative knowledge and comparative research. The first review objective is to map definitions of hate crime, hate incidents, hate speech, and surrogate terms. Specific research questions underpinning this objective are: (a) How are hate crimes, hate speech and hate incidents defined in the academic, legal, policy, and programming literature?; (b) What are the concepts, parameters and criteria that qualify a behavior as being hate crime, hate incident or hate speech?; and (c) What are the most common concepts, parameters and criteria found across definitions? What are the differences between definitions and the elements they contain? The second review objective is to map the tools used to measure the prevalence of hate crime, hate incidents, hate speech, and surrogate terms. Specific research questions underpinning this objective are: (a) How are definitions operationalised to measure hate crimes, hate speech, and hate incidents?; and (b) How valid and reliable are these measures?
- Peer Review Report
- 10.7554/elife.82538.sa2
- Mar 5, 2023
Full text Figures and data Side by side Abstract Editor's evaluation Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Although France was one of the most affected European countries by the COVID-19 pandemic in 2020, the dynamics of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) movement within France, but also involving France in Europe and in the world, remain only partially characterized in this timeframe. Here, we analyzed GISAID deposited sequences from January 1 to December 31, 2020 (n = 638,706 sequences at the time of writing). To tackle the challenging number of sequences without the bias of analyzing a single subsample of sequences, we produced 100 subsamples of sequences and related phylogenetic trees from the whole dataset for different geographic scales (worldwide, European countries, and French administrative regions) and time periods (from January 1 to July 25, 2020, and from July 26 to December 31, 2020). We applied a maximum likelihood discrete trait phylogeographic method to date exchange events (i.e., a transition from one location to another one), to estimate the geographic spread of SARS-CoV-2 transmissions and lineages into, from and within France, Europe, and the world. The results unraveled two different patterns of exchange events between the first and second half of 2020. Throughout the year, Europe was systematically associated with most of the intercontinental exchanges. SARS-CoV-2 was mainly introduced into France from North America and Europe (mostly by Italy, Spain, the United Kingdom, Belgium, and Germany) during the first European epidemic wave. During the second wave, exchange events were limited to neighboring countries without strong intercontinental movement, but Russia widely exported the virus into Europe during the summer of 2020. France mostly exported B.1 and B.1.160 lineages, respectively, during the first and second European epidemic waves. At the level of French administrative regions, the Paris area was the main exporter during the first wave. But, for the second epidemic wave, it equally contributed to virus spread with Lyon area, the second most populated urban area after Paris in France. The main circulating lineages were similarly distributed among the French regions. To conclude, by enabling the inclusion of tens of thousands of viral sequences, this original phylodynamic method enabled us to robustly describe SARS-CoV-2 geographic spread through France, Europe, and worldwide in 2020. Editor's evaluation This paper is a comprehensive, quantitative, and robust overview of the global, European, and French genomic epidemiology of SARS-CoV-2 in the first year of the pandemic. It contributes methodological advances in maximum likelihood phylogeography, using multiple scales and providing a simulation-based validation. The results show two distinct patterns of SARS-CoV-2 exchange events between the first and second half of 2020, with Europe being involved in most intercontinental exchanges: France experienced viral introductions primarily from North America and Europe during the first wave, while the second wave saw limited intercontinental movement and a significant contribution of the virus from Russia into Europe. https://doi.org/10.7554/eLife.82538.sa0 Decision letter Reviews on Sciety eLife's review process Introduction On December 1, 2019, an outbreak of severe respiratory disease was identified in the city of Wuhan, China (Huang et al., 2020). The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was rapidly identified as the agent of the disease (Zhu et al., 2020), responsible for the ongoing global pandemic of coronavirus disease 2019 (COVID-19). By the end of 2020, the virus caused over 1.8 million deaths worldwide including ~65,000 deaths in France, concomitantly with social and economic devastations in many regions of the world (Mofijur et al., 2021; Santomauro et al., 2021). Since the beginning of COVID-19 pandemic, the scientific community has thoroughly characterized the virus, including its pathogenesis, the monitoring of its circulation in human populations, and the development of several treatments or vaccines (Cevik et al., 2020; Krammer, 2020). Epidemiological models have been particularly helpful to quantify viral spread both in the short and long terms and to inform public health decisions (Hoertel et al., 2020; Kissler et al., 2020). In addition to clinical and epidemiological insights, viral whole-genome sequencing has become a powerful and invaluable tool to better understand infection dynamics (Volz et al., 2013), including the COVID-19 pandemic. The number of available SARS-CoV-2 whole-genome sequences has rapidly grown thanks to the efforts of scientists and researchers gathered via international networks such as the Global Initiative on Sharing All Influenza Data, GISAID (https://www.gisaid.org/; Khare et al., 2021). These genomic sequences are essential to effectively reconstruct the global viral spread and the origins of variants. Genomic data have become a strong asset in addition to epidemiological data to inform governments and help public health decisions (Attwood et al., 2022; Rife et al., 2017). However, due to the computational time required for many analyses, existing phylogenetic tools are limited for studying large amounts of data such as those generated by widespread viral sequencing. Therefore, it is still necessary to develop methods to analyze large datasets while optimizing computational calculation times. Producing appropriate subsamples through several replicates may be an efficient approach in this matter. In France, the first COVID-19 suspected case was identified in late December 2019 (Deslandes et al., 2020), and the first confirmed cases of SARS-CoV-2 infection were detected on January 24, 2020, in individuals who had recently traveled in China (Bernard Stoecklin et al., 2020). COVID-19 cases remained scarce until the end of February, when the national incidence curve of new SARS-CoV-2 infections started to rise (Figure 1). By the end of February, reinforced measures were announced, including social distancing, cessation of passenger flights to France, school closure, and finally, a complete lockdown across the entire country from March 17 to May 10, 2020. The reported daily incidence and numbers of severe cases peaked at the beginning of April 2020 before decreasing steadily until August 2020. However, after the relaxation of social distancing measures in June, a second wave of infections occurred in early September peaking at more than 100,000 positive cases and 1300 confirmed deaths in a single day on November 2, 2020 (Figure 1). After this peak, daily incidence and severe COVID-19 cases gradually diminished down to a number of positive daily cases varying between 2000 and 25,000 at the end of 2020 thanks to a second national lockdown applied between October 29 and December 15, 2020. Epidemiological trends were similar in most European countries except for Russia or Romania, where high rates of SARS-CoV-2-related deaths were reported even in the summer of 2020. Of note, the other continents showed different patterns of virus circulation: compared to Europe, the number of deaths increased about 2 weeks later in North America and remained high throughout 2020; and from early May, Asia and South America were also highly impacted by the pandemic (Figure 1—figure supplement 1). Figure 1 with 2 supplements see all Download asset Open asset Timeline of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-related deaths and stringency index in France, 2020. Key events are indicated on the timeline. Official lockdowns included stay home orders and closure of schools and daycares. Based on SARS-CoV-2-related deaths, the two first French epidemic waves are, respectively, dated from March to July 2020, and September to December 2020. SARS-CoV-2-related deaths are displayed as the daily number of deaths (light blue area) and as the weekly average of daily number of deaths (dark blue curve). The stringency (Oxford) index is a composite measure based on different response indicators including school and workplace closures and travel bans, rescaled to a value from 0 to 100 (100 = strictest) (Hale et al., 2021). Elucidating the SARS-CoV-2 dynamic throughout the various phases of the pandemic is paramount to better anticipate how to limit virus circulation for future viral epidemics (Rife et al., 2017). Here, we analyzed GISAID deposited sequences to elucidate the origins and spread of the virus in France, Europe, and the world from January 1 to December 31, 2020. Through a maximum likelihood discrete trait phylogeographic method, we estimated the main geographical areas that contributed to viral introduction into France and Europe, the countries/continents to which France exported SARS-CoV-2 the most and the contribution of the different French regions to the national circulation of the virus. The main exchanged lineages were also investigated. We looked at the differences in virus circulation during each of the two European epidemic waves of 2020 independently. Given France's central geographic location in Europe and the high proportion of international travelers visiting this country before the pandemic, we aimed to explore the role that France played in SARS-CoV-2 exchanges both in Europe and worldwide. Results Defining appropriate subsamples using simulations From January 1 to December 31, 2020, a total of 638,706 sequences were retained in our study. Inferring a phylogenic tree with such a large number of sequences would require very long calculation times. To overcome this limit, we constructed smaller datasets by randomly choosing subsamples (with replacement) of the sequences. The number of sequences for each country at each week was chosen to be proportional to the number of SARS-CoV-2-related deaths per country and per week with a 2-week shift to account for the time between infection and death. As a proof of concept, we conducted an extensive simulation study to estimate the accuracy of the discrete trait phylogeographic inference for rates of transitions between two distinct locations. First, to evaluate the precision of such inference on a tree of 1000 leaves, we simulated a two-states model with different combinations of transition rates in 50 replicates. Parameters were correctly estimated with limited variability across the 50 replicates. The median parameters across replicates gave a very accurate estimation (Figure 2A). Figure 2 with 4 supplements see all Download asset Open asset Estimating variability in transition rates using simulations. (A) Estimated versus true parameters in the simulation study of the two-states model. The two panels show the two transition rates. For each set of parameters, 50 replicates were conducted. The large red dot is the median of the replicates. The red cross is the true parameter value, on the bisector. (B) Estimated rate of transition in subsampled trees. For each replicate (n = 50), one point is the result of one subsampled smaller phylogenetic tree (from a large phylogenetic tree). The big dot shows the median for each replicate. The horizontal red line is the overall median (of the medians), across replicates. The horizontal dashed gray line is the true rate. Only one of the two rate parameters is shown. (C) Log-median error in parameter estimation as a function of the log number of replicates, when inference is conducted on truly independent replicate evolutionary histories, on a tree of 1000 leaves. The points are the data, the dashed line shows the line of slope '−1' which is the expectation as the replicates are truly independent. (D) Log-median error as a function of log number of subsamples used for the inference done on subsampled phylogenetic trees. The colored points and lines show the inference done on 50 distinct realizations of the evolutionary process on the whole tree. The dashed line is the overall regression line with a slope of −0.7. We tested between 1 and 10 subsampled trees (x-axis). Next, to evaluate how independent parameter estimates are done on randomly subsampled trees of the same larger phylogeny, we inferred parameters on 50 100-leaves trees randomly subsampled from a 10,000-leaves SARS-CoV-2 phylogenetic tree. For each resulting subtree, we conducted inferences on 50 replicates corresponding to 50 realizations of the stochastic process of evolution of the discrete character – as done in the first simulation – on the whole tree of 10,000 leaves. For each replicate, we observed some error on the estimation of the parameter, because one replicate only corresponds to one possible realization of the evolutionary process, although the overall median of inferred parameters across subsampled trees was closer to the true parameter values (Figure 2B). Different estimations of the transition rates conducted on different subsampled trees are not expected to be fully independent because the subtrees partly share the same evolutionary history. Therefore, we estimated the level of independence of these estimations. When several estimates are perfectly independent from one another and are averaged to obtain the final estimate of the quantity of interest, we expect the error in parameter estimation to converge to 0 with a 1/N (N−1) scaling, where N is the number of replicates. This is indeed what we observed when we calculated the error on estimation of the parameter as a function of the chosen number of replicates N in the first set of simulations. Here, the replicates were truly independent replicate realizations of the evolutionary history and inference was conducted on the whole tree of 1000 leaves (Figure 2C). On the contrary, when estimates are perfectly dependent, error on the averaged parameter estimate is expected to not decrease with N. When evaluating the error on parameter estimates across subsamples of the large tree, we expected the scaling of error as a function of number of subsamples N to be intermediate between non-independence (~N0 scaling) and perfect independence (~N−1 scaling). Using the relationship between log(error) as a function of log(N), we estimated a slope of −0.7 (Figure 2D). Thus, inferences conducted on subsamples of the same phylogenetic tree are partly independent. The precise degree of independence is expected to depend on the shape of the phylogenetic tree, but the coefficient was similar when doing the same study on a randomly generated tree instead of the SARS-CoV-2 tree. We finally conducted another round of simulations to evaluate the error on what we considered as exchange between multiple locations when using sparse subsampling. For that, a 1,000,000-leaves tree was simulated with a five-states discrete trait representing geographical units. Then, 100 subsampled 1000-leaves trees from the whole phylogenetic tree were produced and the ancestry for the discrete trait was reconstructed from the leaf data only. We estimated the number of transitions (exchanges) of each type and compared them with the one obtained from the main tree, finding a mean error rate of 2.7% over the 100 subsamples (Figure 2—figure supplement 1). Altogether, these simulations suggested that using subsamples of 1000 sequences from a large dataset and performing partially independent replicates seems to be sufficient to accurately estimate transition events. Description of the datasets and global diversity of SARS-CoV-2 sequences We defined 100 subsamples of sequences proportionally to COVID-19 deaths across geographic locations and time for different geographic scales (worldwide, Europe, and French regions) and time periods (from January 1 to July 25, 2020, and from July 26 to December 31, 2020, respectively, covering the first and second European epidemic waves). We chose the sampling intensity guided by the weekly number of SARS-CoV-2-related deaths reported by public health organizations. Here, the number of SARS-CoV-2-related deaths was used rather than the number of detected cases because the latter was biased due to variable ascertainment rates across countries and time. For example, the larger number of PCR tests conducted in the second epidemic wave could wrongly suggest that the virus circulated much more during the second half of 2020 (Figure 1—figure supplement 2). For each geographic scale and time period, there was a positive correlation between the weekly number of SARS-CoV-2-related deaths and the weekly number of sequences we included for a subsample (Spearman's rank correlation, p < 0.001; r = 0.94 for the lowest correlation). We also confirmed that the number of sequences per territory was, on average, properly temporally distributed within each time period (Figure 2—figure supplement 2). Some countries and French administrative regions were however discarded in the analyses because they were not sufficiently represented in the GISAID database. Overall, a total of 39,288 and 39,755 distinct SARS-CoV-2 sequences were included across the 100 sampled phylogenies for the worldwide dataset, respectively, for the first and the second time periods (Table 1). At the European scale, 26,757 and 27,658 different SARS-CoV-2 sequences covering 11 countries were analyzed across the 100 subsamples (Table 1). Focusing on French administrative regions, sequences available on the GISAID database were very sparse. The Provence-Alpes-Côte d'Azur (PACA, Marseille area) was the only region that highly sequenced SARS-CoV-2 in 2020. Île-de-France (IDF, Paris area), Auvergne-Rhône-Alpes (ARA, Lyon area), Occitanie (OCC, Toulouse and Montpellier area), and Bretagne (BRE, Rennes area) have sequenced much less than PACA, but provided sufficient data to investigate SARS-CoV-2 geographic exchange events in France. The remaining French administrative regions were discarded since too few sequences were available to properly match the number of weekly SARS-CoV-2 deaths (Figure 2—figure supplement 3). We thus considered 2543 unique sequences across the 100 subsamples between January 1 and July 25, 2020, and 3124 unique sequences between July 26 and December 31, 2020 (Table 1). Table 1 Number of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences investigated for each dataset. DatasetGeographiesPeriod investigatedAverage number of sequences sampled in a subsampleTotal number of sequencesWorldAfrica, Asia, Europe, France, North America, Oceania, South AmericaJanuary 1 to July 25, 202084639,288July 26 to December 31, 202077739,755EuropeBelgium, France, Germany, Italy, The Netherlands, Poland, Romania, Russia, Spain, Sweden, United KingdomJanuary 1 to July 25, 202090426,757July 26 to December 31, 202087227,658FranceAuvergne-Rhône-Alpes (ARA), Bretagne (BRE), Île-de-France (IDF), Occitanie (OCC), Provence-Alpes-Côte d'Azur (PACA)January 1 to July 25, 20204162543July 26 to December 31, 20204333124 The genomic diversity of circulating SARS-CoV-2 in the different continents, countries, and French regions was found to be similar (Figure 2—figure supplement 4). Overall, genomes showed high sequence conservation compared to the Wuhan-Hu-1 reference in 2020 (mean and median of ~13 single nucleotide polymorphisms (SNPs) with 95% of the distribution comprised between 4 and 25 SNPs). Which continents exchanged SARS-CoV-2 with Europe and France? Through 100 distinct, dated and ancestrally reconstructed phylogenetic trees, we first studied SARS-CoV-2 exchanges worldwide for each of the time periods studied. Between January 1 and July 25, 2020 (covering the first European epidemic wave), we found that Europe (excluding France) accounted for 57.3% of the total number of exportation events, and was the main source of SARS-CoV-2 exportations toward the other continents in all of the subsamples (Figure 3A–D and Figure 3—figure supplement 1). North America also highly participated in virus exportation during this period time (24.3%). South America and Asia were each associated with 7.1% of the total number of exportation events, consistent with a later circulation of the virus in these continents (Figure 1—figure supplement 1). France was estimated to have contributed of the total exportation events, that France was not the European source of SARS-CoV-2 at the international level between January 1 and July 25, 2020. The exportation events from France were mostly toward Europe to a to North America, South America, and Asia (Figure and Figure 3—figure supplement 2). These events mostly of the B.1 and lineages (Figure North America a large proportion of SARS-CoV-2 from other continents of the introduction by South America and Europe (Figure average of of all SARS-CoV-2 introductions were into France, and from North America and Europe (Figure 3—figure supplement 2). These introductions of the B.1 and lineages (Figure The first introductions into France were detected at the beginning of February, and increased to a before the lockdown from March 2020 (Figure Only South America and Asia were associated with a in SARS-CoV-2 introductions after this because such measures were there and the circulation of the virus remained limited in these regions. Figure with 2 supplements see all Download asset Open asset acute respiratory syndrome coronavirus 2 (SARS-CoV-2) exchange worldwide. events were inferred with 100 subsampled phylogenies between January 1 and July 25, 2020, and between July 26 and December 31, 2020. (A) Number of introduction and exportation events for each subsample and for each and France. (B) SARS-CoV-2 exchange between continents and France during the two time periods investigated. In these of a location to the and with an at the is proportional to the exchange (C) Number of exportation and (D) introduction events per territory over time. The mean number of exchanges over the subsamples and for each week was the of the complete lockdowns in France. of lineages exported from France and introduced into France. with a proportion were into the From July 26 to December 31, 2020 European epidemic wave), we observed exchange events worldwide compared to the first half of 2020. we showed the of analyzing several as there was a large in the total number of exportation or introduction events, in Europe (Figure Europe was, as between January 1 and July 25, 2020, the main source of exchanges with a total of of the exportation events across by North America Asia and South America (Figure 3A–D and Figure 3—figure supplement 1). of the events occurred during the summer period to August 2020), corresponding to the summer in most countries of the world. France accounted for of the exportation events, but they were toward other European countries and overall detected from August to November 2020 (Figure and Figure 3—figure supplement consistent with the SARS-CoV-2 incidence in this period in France (Figure 1—figure supplement 2). The B.1.160 accounted for all the exportation events from France (Figure In a similar SARS-CoV-2 introductions into France mostly from Europe (Figure and were detected at a rate from April 2020, at a but limited rate from 2020, and at a strong level in September and October 2020 (Figure 3—figure supplement 2). These SARS-CoV-2 introductions into France in of B.1.160 B.1 and lineages (Figure the virus spread in We aimed to a more of SARS-CoV-2 exchanges between France and other European countries with the same Here, we only on European countries associated with a high incidence and without due to a of data on GISAID (Table 1). By the of introduction and exportation events between January 1 and July 25, 2020 across the we observed that was the to virus exportation toward other European countries, with an average of of the total number of exportation events. The United Kingdom, France, and also highly participated in virus and of and of the total number of exportation events, (Figure and Figure supplement 1). These are in line with epidemiological data, since was the first country in Europe to be affected by the and France, the United Kingdom, and were the other European countries associated with the number of SARS-CoV-2-related deaths during the first wave (Figure 1—figure supplement 1). The number of all exportation events however after the of lockdowns in the different countries (with the first one in on March (Figure France mostly exported SARS-CoV-2 toward and the United and a less toward and the All of these events occurred before the lockdown in France (Figure and Figure supplement and of the B.1 and lineages (Figure The rate of SARS-CoV-2 exportations from France until the second European epidemic wave, as it was also the case for other European countries except Russia (Figure For all introduction events, the were more the United accounted for a of the total number of events, while Russia, Belgium, Germany, Italy, Spain, France, and the represented between and of the total number of events (Figure In France, a high rate of introduction events was observed in and March before the lockdown and mostly from the United and (Figure supplement 2). These introductions in of the and B.1 lineages (Figure Figure 4 with 2 supplements see all Download asset Open asset acute respiratory syndrome coronavirus 2 (SARS-CoV-2) exchanges on the European events were calculated by the results from 100 subsampled phylogenies between January 1 and July 25, 2020, and between July 26 and December 31, 2020. (A) Number of introduction and exportation events for each subsample and for each European (B) SARS-CoV-2 exchange between European countries during the two time periods investigated. In these of a location to the and with an at is proportional to the exchange (C) Number of exportation and (D) introduction events per territory over time. The mean number of exchanges over subsamples and for each week was the of the complete lockdowns in France. of lineages exported from France and introduced into France. with a proportion were into the The second time period (from July 26 to December 31, showed a different of exchanges. Here, we estimated exchanges compared to the first half of 2020. Russia accounted for most of the exportation events (Figure These events were estimated to during the the relaxation of measures in most European and the summer periods (Figure This result was expected since Russia was the European country to a high number of SARS-CoV-2-related deaths during this period (Figure 1—figure supplement 1). France and the United also highly participated in virus exportation (Figure and Figure supplement 1). of these events were detected between August and October 2020 (Figure and before the second lockdown in most European countries first one in on October 2020). these are consistent with epidemiological as was the first country in European to be associated with a of SARS-CoV-2-related deaths, rapidly by France. France mostly exported the virus toward and (Figure and Figure supplement and mostly the B.1.160 (Figure Focusing on introduction events, accounted for of the total number of by the United and France For the remaining European countries, the proportion of introduction events was comprised between and
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.