Generating Political Event Data in Near Real Time: Opportunities and Challenges

John Beieler,Andrew Halterman,Erin M Simpson,Patrick T Brandt,Philip A Schrodt

doi:10.1017/cbo9781316257340.005

Abstract

INTRODUCTION Political event data are records of interactions among political actors using common codes for actors and actions, allowing for the aggregate analysis of political behaviors. These data include both material interactions between political entities and verbal statements. Such data are common in international relations, recording the spoken or direct actions between nation-states and other political entities. Event data can be generated through either human-coded or machinebased methods. Human-coded event data efforts continue to dominate research on global protests and social movements, although data sets in international relations have led the movement toward automated coding. While humans are better able to extract the meaning in sentences using background knowledge and innate abilities for dealing with complex grammatical constructions, human coding is dramatically more labor and time intensive than machinecoding approaches for anything but small or one-off data sets. Machine-coded methods can attain 70–80% accuracy when compared to a human-coded “gold standard,” which is comparable to, and in some cases exceeds, the intercoder reliability of human coding (King and Lowe, 2004). This makes the machine-coded methods quite scalable in terms of costs and time and thus attractive to academic, government, and private sector researchers. King (2011) notes that the ability to code and process political texts to generate records like event data will be de rigueur in the later part of the 21st century. Machine-readable text about politics, including news reports, speeches, press conferences, and intelligence reports, are already the basis of many political analyses. The ever-increasing availability of such texts presents both opportunities and challenges because they are a form of “big data.” Even processing just the lead sentences of Reuters and Agence France-Presse (AFP) news reports for the Levant from 1979–2011 generates more than 140,000 distinct time-series records (http://eventdata.parusanalytics.com/data.dir/levant.html), and these sentences could also be processed as a much larger set of network relationships. One recent effort to expand event data collection outside of this geographical region – albeit without the event de-duplication found in most event data sets – has generated nearly a quarter of a billion records. Extrapolating from our coding experience with the Levant and our initial experiments with the EL:DIABLO coding system described later, we estimate that a data collection with duplication controls like that for the Levant data set will generate around 4,000 to 8,000 distinct records per day for the entire globe.

Full Text