Abstract

We provide a new text corpus from the social medium Telegram, which is rich in indirect forms of divisive speech. We scraped all messages from one channel of Donald Trump supporters, covering a large part of his presidency, from late 2016 until January 2021, including the January 6 Capitol riot. The discussion among the group members, over this long time period, includes the spread of disinformation, disparaging of out-group members, and other forms of harmful speech. To enable research into the role of harmful speech in political discourse, we added two types of annotations to the corpus: (i) automatic annotations of offensive language for all messages, and (ii) our own manual annotations of harmful language for a portion of the posts leading up to the January 2021 Capitol riot and its aftermath.

Highlights

  • While many similar channels introduced the policy of daily chat history purge, this channel essentially preserved its integrity from the day it was created on December 11, 2016. It represents a unique testimony of a controversial period of American history, by providing a rich source of harmful speech and practices

  • The content and metadata were mined using the Telethon1 Python package. This is an interface to the Telegram API which facilitates interaction with Telegram and application development

  • Our data suggests there is utility in evaluating these methods based on novel data, such as our corpus. This is because the in-group community that tends to populate Telegram channels has a different dynamic than that of more open and heterogenous communities present on Facebook, Twitter, or YouTube

Read more

Summary

METHOD

The data collection represents one public channel from the platform Telegram, encompassing four years of Donald Trump Jr.’s presidency, through the prism of his supporters’ conversations, leading up to and including discussion of the January 6 Capitol riot. While many similar channels introduced the policy of daily chat history purge, this channel essentially preserved its integrity from the day it was created on December 11, 2016 It represents a unique testimony of a controversial period of American history, by providing a rich source of harmful speech and practices. Our data contains the metadata, including date and time of post creation, message ID, user ID, the ID of the message replied to, any attached media (e.g., image, video, sticker), as well as the message text itself This may be useful for further research modelling the interactions among participants in the community. As a result of the controversial nature of the data, 3,619 additional messages originally posted in the channel appear to have been deleted prior to collection, leaving blank message content, which we filtered out This reduced the initial 1,068 unique users to 521. 1 https://docs.telethon.dev. 2 See (Scheer et al 2021) and the annotation guidelines for examples and specifications

DATASET DESCRIPTION
REUSE POTENTIAL
Findings
FUNDING STATEMENT
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call