Abstract

In real time, Twitter strongly imprints world events, popular culture, and the day-to-day, recording an ever-growing compendium of language change. Vitally, and absent from many standard corpora such as books and news archives, Twitter also encodes popularity and spreading through retweets. Here, we describe Storywrangler, an ongoing curation of over 100 billion tweets containing 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into 1-, 2-, and 3-grams across 100+ languages, generating frequencies for words, hashtags, handles, numerals, symbols, and emojis. We make the dataset available through an interactive time series viewer and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of tracking dynamic changes in n-grams can be extended to any temporally evolving corpus. Illustrating the instrument's potential, we present example use cases including social amplification, the sociotechnical dynamics of famous individuals, box office success, and social unrest.

Highlights

  • Our collective memory lies in our recordings—in our written texts, artworks, photographs, audio, and video—and in our retellings and reinterpretations of that which becomes history

  • Using rank-turbulence divergence (RTD) [35], we examine the daily rate of usage of each n-gram, assessing the subset of n-grams that have become most inflated in relative usage

  • Along with phrases associated with important events, Storywrangler encodes casual daily conversation in a format unavailable through newspaper articles and books

Read more

Summary

Introduction

Our collective memory lies in our recordings—in our written texts, artworks, photographs, audio, and video—and in our retellings and reinterpretations of that which becomes history. Large-scale constructions of historical corpora often fail to encode a fundamental characteristic: popularity (i.e., social amplification). For text-based corpora, we are confronted with the challenge of sorting through different aspects of popularity of n-grams— sequences of n “words” in a text that are formed by contiguous characters, numerals, symbols, emojis, etc. It is well established that n-gram frequency-of-usage (or Zipf) distributions are heavy tailed [17]. This essential character of natural language is readily misinterpreted as indicating cultural popularity. The Google Books n-gram corpus [1], which, in part, provides inspiration for our work here, presents year-scale, n-gram frequency time series where each book, in principle, counts only once [2]. The words of George Orwell’s 1984 or Rick Riordan’s Percy

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call