Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.

Ari Z Klein,Jesus Ivan Flores Amaro,Karen O'Connor,Graciela Gonzalez Hernandez,Davy Weissenbacher,Arjun Magge

doi:10.2196/25314

Ari Z Klein, Jesus Ivan Flores Amaro + Show 4 more

Open Access

https://doi.org/10.2196/25314

Copy DOI

Journal: Journal of medical Internet research	Publication Date: Jan 22, 2021
Citations: 46	License type: cc-by

Abstract

BackgroundIn the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone.ObjectiveThe objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention.MethodsBeginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020.ResultsInterannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations.ConclusionsWe have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

Highlights

In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results have presented challenges for actively monitoring the spread of COVID-19 based on testing alone
Since none of the tweets in Textbox 1 report being tested for or diagnosed with COVID-19, they represent potential cases that may not have been reported to the Centers for Disease Control and Prevention (CDC)
The classifier based on the bidirectional encoder representations from transformers (BERT)-Base-Uncased pretrained model achieved an F1-score of 0.70 for the “potential case” class, and the classifier based on the COVID-Twitter-BERT pretrained model achieved an F1-score of 0.76, where: We deployed our automatic pipeline, using the COVID-Twitter-BERT classifier, on more than 85 million unlabeled tweets that were continuously collected from the Twitter Streaming application programming interface (API) between March 1 and August 21, 2020

Summary

Introduction

In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results have presented challenges for actively monitoring the spread of COVID-19 based on testing alone. We present a publicly available data set containing 13,714 tweets that were identified by our automatic NLP pipeline between March 1 and August 21, 2020, with each tweet’s time stamp and US state–level geolocation. This data set presents the opportunity to explore the use of Twitter data as a complementary resource “to understand and model the transmission and trajectory of COVID-19” [10]. In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone

Methods

Results

Discussion

Conclusion