Abstract

BackgroundSocial media are important for monitoring perceptions of public health issues and for educating target audiences about health; however, limited information about the demographics of social media users makes it challenging to identify conversations among target audiences and limits how well social media can be used for public health surveillance and education outreach efforts. Certain social media platforms provide demographic information on followers of a user account, if given, but they are not always disclosed, and researchers have developed machine learning algorithms to predict social media users’ demographic characteristics, mainly for Twitter. To date, there has been limited research on predicting the demographic characteristics of Reddit users.ObjectiveWe aimed to develop a machine learning algorithm that predicts the age segment of Reddit users, as either adolescents or adults, based on publicly available data.MethodsThis study was conducted between January and September 2020 using publicly available Reddit posts as input data. We manually labeled Reddit users’ age by identifying and reviewing public posts in which Reddit users self-reported their age. We then collected sample posts, comments, and metadata for the labeled user accounts and created variables to capture linguistic patterns, posting behavior, and account details that would distinguish the adolescent age group (aged 13 to 20 years) from the adult age group (aged 21 to 54 years). We split the data into training (n=1660) and test sets (n=415) and performed 5-fold cross validation on the training set to select hyperparameters and perform feature selection. We ran multiple classification algorithms and tested the performance of the models (precision, recall, F1 score) in predicting the age segments of the users in the labeled data. To evaluate associations between each feature and the outcome, we calculated means and confidence intervals and compared the two age groups, with 2-sample t tests, for each transformed model feature.ResultsThe gradient boosted trees classifier performed the best, with an F1 score of 0.78. The test set precision and recall scores were 0.79 and 0.89, respectively, for the adolescent group (n=254) and 0.78 and 0.63, respectively, for the adult group (n=161). The most important feature in the model was the number of sentences per comment (permutation score: mean 0.100, SD 0.004). Members of the adolescent age group tended to have created accounts more recently, have higher proportions of submissions and comments in the r/teenagers subreddit, and post more in subreddits with higher subscriber counts than those in the adult group.ConclusionsWe created a Reddit age prediction algorithm with competitive accuracy using publicly available data, suggesting machine learning methods can help public health agencies identify age-related target audiences on Reddit. Our results also suggest that there are characteristics of Reddit users’ posting behavior, linguistic patterns, and account features that distinguish adolescents from adults.

Highlights

  • Public health campaigns are a primary means for government agencies and nongovernmental organizations to raise awareness about important health issues affecting their communities

  • With increased media consumption and interpersonal interactions occurring online, social media platforms have become important in both engaging target audiences in public education campaigns and in understanding behaviors and perceptions around emerging public health issues across these target audiences

  • The fields that we examined for predicting age of Reddit users included summary statistics of metadata fields from the application programming interface (API) and other derived variables that may help distinguish adolescent from older age groups

Read more

Summary

Introduction

Public health campaigns are a primary means for government agencies and nongovernmental organizations to raise awareness about important health issues affecting their communities. With increased media consumption and interpersonal interactions occurring online, social media platforms have become important in both engaging target audiences in public education campaigns and in understanding behaviors and perceptions around emerging public health issues across these target audiences. In tobacco prevention and control, being able to segment social media posts by age-based audience segments, would help researchers and public health agencies identify emerging issues and changes in behaviors and attitudes to facilitate public health surveillance and educational outreach to these at-risk populations. This would allow researchers to naturalistically observe of their target audience and how they interact with the discussion of tobacco products. There has been limited research on predicting the demographic characteristics of Reddit users

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call