Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines.

Karen O'Connor,Jeanmarie Perrone,Abeed Sarker,Graciela Gonzalez Hernandez

doi:10.2196/15861

Karen O'Connor, Jeanmarie Perrone + Show 2 more

Open Access

https://doi.org/10.2196/15861

Copy DOI

Abstract

BackgroundSocial media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. Recently, a number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization that discuss how information related to abuse-prone medications is presented on Twitter.ObjectiveThis study discusses the creation of an annotated corpus suitable for training supervised classification algorithms for the automatic classification of medication abuse–related chatter. The annotation strategies used for improving interannotator agreement (IAA), a detailed annotation guideline, and machine learning experiments that illustrate the utility of the annotated corpus are also described.MethodsWe employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve IAA for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into 4 broad classes—abuse or misuse, personal consumption, mention, and unrelated. After the completion of manual annotations, we experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data.ResultsOur final annotated set consisted of 16,443 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs. Our final overall IAA was 0.86 (Cohen kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse or abuse is discussed on Twitter, including expressions indicating coingestion, nonmedical use, nonstandard route of intake, and consumption above the prescribed doses. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271).ConclusionsOur manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.

Highlights

BackgroundSocial media has provided a platform for internet users to share experiences and opinions, and the abundance of data available has turned social networking websites into valuable resources for research
We employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve interannotator agreement (IAA) for the manual annotation task
We present here an analysis of how prescription medication abuse information is presented on Twitter, the details of a large-scale annotation process that we have conducted, annotation guidelines that may be used for future annotation efforts, and a large annotated dataset involving various abuse-prone medications that we envision will drive community-driven data science and natural language processing (NLP) research on the topic

Summary

Introduction

BackgroundSocial media has provided a platform for internet users to share experiences and opinions, and the abundance of data available has turned social networking websites into valuable resources for research. Social media chatter encapsulates knowledge regarding diverse topics such as politics [1], sports [2], and health [3]. Users seek and share health-related information on social media regularly, resulting in the continuous generation of knowledge regarding health conditions, drugs, interventions, and health care policies. Social media has become an important source of data, for public health monitoring because the data generated can be collected and processed in near real-time to make population-level estimates. Social media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. A number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse.

Methods

Results

Discussion

Conclusion