Using Machine Learning–Based Approaches for the Detection and Classification of Human Papillomavirus Vaccine Misinformation: Infodemiology Study of Reddit Discussions

Jingcheng Du,Sharice Preston,Hanxiao Sun,Muhammad Amith,Lara Savas,Julie Boom,Cui Tao,Rachel Cunningham,Ross Shegog

doi:10.2196/26478

Abstract

BackgroundThe rapid growth of social media as an information channel has made it possible to quickly spread inaccurate or false vaccine information, thus creating obstacles for vaccine promotion.ObjectiveThe aim of this study is to develop and evaluate an intelligent automated protocol for identifying and classifying human papillomavirus (HPV) vaccine misinformation on social media using machine learning (ML)–based methods.MethodsReddit posts (from 2007 to 2017, N=28,121) that contained keywords related to HPV vaccination were compiled. A random subset (2200/28,121, 7.82%) was manually labeled for misinformation and served as the gold standard corpus for evaluation. A total of 5 ML-based algorithms, including a support vector machine, logistic regression, extremely randomized trees, a convolutional neural network, and a recurrent neural network designed to identify vaccine misinformation, were evaluated for identification performance. Topic modeling was applied to identify the major categories associated with HPV vaccine misinformation.ResultsA convolutional neural network model achieved the highest area under the receiver operating characteristic curve of 0.7943. Of the 28,121 Reddit posts, 7207 (25.63%) were classified as vaccine misinformation, with discussions about general safety issues identified as the leading type of misinformed posts (2666/7207, 36.99%).ConclusionsML-based approaches are effective in the identification and classification of HPV vaccine misinformation on Reddit and may be generalizable to other social media platforms. ML-based methods may provide the capacity and utility to meet the challenge involved in intelligent automated monitoring and classification of public health misinformation on social media platforms. The timely identification of vaccine misinformation on the internet is the first step in misinformation correction and vaccine promotion.

Highlights

BackgroundHuman papillomavirus (HPV) infection is a highly prevalent sexually transmitted infection
We report the utility of various conventional machine learning (ML) and Deep learning (DL) algorithms to automatically identify and categorize misinformation on the human papillomavirus (HPV) vaccine using posts on Reddit, a popular social media platform with more than 330 million monthly active users [24]
Our approach can be divided into two steps: (1) evaluation of ML algorithms for vaccine misinformation identification and (2) topic modeling on Reddit posts that contain vaccine misinformation (ML-inferred)

Summary

Introduction

BackgroundHuman papillomavirus (HPV) infection is a highly prevalent sexually transmitted infection. The rapid growth of social media as an information channel has made it possible to quickly spread inaccurate or false information and create a platform for antivaccine campaigns to promulgate vaccine-related misinformation [9]. The rapid growth of social media as an information channel has made it possible to quickly spread inaccurate or false vaccine information, creating obstacles for vaccine promotion. Objective: The aim of this study is to develop and evaluate an intelligent automated protocol for identifying and classifying human papillomavirus (HPV) vaccine misinformation on social media using machine learning (ML)–based methods. Conclusions: ML-based approaches are effective in the identification and classification of HPV vaccine misinformation on Reddit and may be generalizable to other social media platforms. The timely identification of vaccine misinformation on the internet is the first step in misinformation correction and vaccine promotion

Methods

Results

Discussion

Conclusion