Abstract

Abstract We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases that enable comparative social and political science studies. For each news source, the annotation starts with random samples of news articles and continues with samples drawn using active learning. Each batch of samples is annotated by two social and political scientists, adjudicated by an annotation supervisor, and improved by identifying annotation errors semi-automatically. We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting, contributing to the generalizability and robustness of automated text processing systems. This corpus and the reported results will establish a common foundation in automated protest event collection studies, which is currently lacking in the literature.

Highlights

  • Socio-political event knowledge bases enable comparative social and political studies

  • We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries

  • Since news media provide a continuous flow of data over time and enable researchers to determine the significance of events that are reported, social and political scientists turn to news data to create knowledge bases of protest events [3, 4, 5]

Read more

Summary

Introduction

Socio-political event knowledge bases enable comparative social and political studies. Since news media provide a continuous flow of data over time and enable researchers to determine the significance of events that are reported, social and political scientists turn to news data to create knowledge bases of protest events [3, 4, 5]. As members of the Emerging Welfare (EMW) project, we took on the challenge of creating a common foundation in terms of the required high-quality data, and state-of-the-art tools for fully automating the creation of reliable and valid protest knowledge bases. This foundation would serve as a benchmark and enable protest event collection studies to benefit. This effort has yielded a gold standard corpus (GSC) that will serve the machine learning (ML) and computational linguistics communities to study text processing tool development for constructing knowledge bases of protest events

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.