Abstract

Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes—the leading cause of infant mortality—could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms—feature-engineered and deep learning-based classifiers—that automatically distinguish tweets referring to the user’s pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F1-score of 0.65 for the “defect” class and 0.51 for the “possible defect” class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.

Highlights

  • Despite the fact that birth defects are the leading cause of infant mortality in the United States,1 methods for observing pregnancies with birth defect outcomes remain limited

  • Our pipeline11 begins by collecting all the publicly available tweets of women who have announced their pregnancy on Twitter, which enables the use of social media for selecting internal comparator groups, and provides a unique opportunity of exploring unknown risk factors among the chatter

  • The classifiers were evaluated on a held-out test set—a random sample of 20% of the annotated corpus (4602 tweets), stratified based on the natural, imbalanced distribution of “defect,” “possible defect,” and “non-defect” tweets that would be automatically detected by the lexicon-based retrieval10 in practice

Read more

Summary

Introduction

Despite the fact that birth defects are the leading cause of infant mortality in the United States, methods for observing pregnancies with birth defect outcomes remain limited (e.g., clinical trials, animal studies, pregnancy exposure registries). We identified a small cohort in a database containing the timelines— the publicly available tweets posted by a user over time—of more than 100,000 users automatically identified via their public announcements of pregnancy on Twitter.. We identified a small cohort in a database containing the timelines— the publicly available tweets posted by a user over time—of more than 100,000 users automatically identified via their public announcements of pregnancy on Twitter.11 We used their timelines to conduct an observational case-control study, in which we compared select risk factors among the women reporting a birth defect outcome (cases) and users for whom we did not detect a birth defect outcome, selected from the same database (controls). Because our pipeline continues to collect the tweets that users post after pregnancy, social media provides a means of long-term follow-up after birth

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call