Physical Activity, Sedentary Behavior, and Sleep on Twitter: Multicountry and Fully Labeled Public Data Set for Digital Public Health Surveillance Research.

Zahra Shakeri Hossein Abad,Joon Lee,Gregory P Butler,Wendy Thompson

doi:10.2196/32355

Abstract

BackgroundAdvances in automated data processing and machine learning (ML) models, together with the unprecedented growth in the number of social media users who publicly share and discuss health-related information, have made public health surveillance (PHS) one of the long-lasting social media applications. However, the existing PHS systems feeding on social media data have not been widely deployed in national surveillance systems, which appears to stem from the lack of practitioners and the public’s trust in social media data. More robust and reliable data sets over which supervised ML models can be trained and tested reliably is a significant step toward overcoming this hurdle. The health implications of daily behaviors (physical activity, sedentary behavior, and sleep [PASS]), as an evergreen topic in PHS, are widely studied through traditional data sources such as surveillance surveys and administrative databases, which are often several months out-of-date by the time they are used, costly to collect, and thus limited in quantity and coverage.ObjectiveThe main objective of this study is to present a large-scale, multicountry, longitudinal, and fully labeled data set to enable and support digital PASS surveillance research in PHS. To support high-quality surveillance research using our data set, we have conducted further analysis on the data set to supplement it with additional PHS-related metadata.MethodsWe collected the data of this study from Twitter using the Twitter livestream application programming interface between November 28, 2018, and June 19, 2020. To obtain PASS-related tweets for manual annotation, we iteratively used regular expressions, unsupervised natural language processing, domain-specific ontologies, and linguistic analysis. We used Amazon Mechanical Turk to label the collected data to self-reported PASS categories and implemented a quality control pipeline to monitor and manage the validity of crowd-generated labels. Moreover, we used ML, latent semantic analysis, linguistic analysis, and label inference analysis to validate the different components of the data set.ResultsLPHEADA (Labelled Digital Public Health Dataset) contains 366,405 crowd-generated labels (3 labels per tweet) for 122,135 PASS-related tweets that originated in Australia, Canada, the United Kingdom, or the United States, labeled by 708 unique annotators on Amazon Mechanical Turk. In addition to crowd-generated labels, LPHEADA provides details about the three critical components of any PHS system: place, time, and demographics (ie, gender and age range) associated with each tweet.ConclusionsPublicly available data sets for digital PASS surveillance are usually isolated and only provide labels for small subsets of the data. We believe that the novelty and comprehensiveness of the data set provided in this study will help develop, evaluate, and deploy digital PASS surveillance systems. LPHEADA will be an invaluable resource for both public health researchers and practitioners.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: JMIR Public Health and Surveillance	Publication Date: Feb 14, 2022
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Physical Activity, Sedentary Behavior, and Sleep on Twitter: Multicountry and Fully Labeled Public Data Set for Digital Public Health Surveillance Research.

Abstract

Talk to us

Similar Papers

More From: JMIR Public Health and Surveillance

Lead the way for us

Similar Papers

Crowdsourcing for Machine Learning in Public Health Surveillance: Lessons Learned From Amazon Mechanical Turk.
Zahra Shakeri Hossein Abad ... Joon Lee
Journal of Medical Internet Research | VOL. 24
Zahra Shakeri Hossein Abad, et. al.Zahra Shakeri Hossein Abad ... Joon Lee
18 Jan 2022
Journal of Medical Internet Research | VOL. 24

Concepts, objectives and analysis of public health surveillance systems
Hurmat Ali Shah ... Mowafa Househ
Computer Methods and Programs in Biomedicine Update | VOL. 5
Hurmat Ali Shah, et. al.Hurmat Ali Shah ... Mowafa Househ
01 Jan 2024
Computer Methods and Programs in Biomedicine Update | VOL. 5

Calibrating Wrist-Worn Accelerometers for Physical Activity Assessment in Preschoolers: Machine Learning Approaches.
Shiyu Li ... Deborah Parra-Medina
JMIR Formative Research | VOL. 4
Shiyu Li, et. al.Shiyu Li ... Deborah Parra-Medina
31 Aug 2020
JMIR Formative Research | VOL. 4

Public health surveillance in the U.S. Department of Veterans Affairs: evaluation of the Praedico surveillance system
Cynthia Lucero-Obusan ... Anoshiravan Mostaghimi
BMC Public Health | VOL. 22
Cynthia Lucero-Obusan, et. al.Cynthia Lucero-Obusan ... Anoshiravan Mostaghimi
10 Feb 2022
BMC Public Health | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Physical Activity, Sedentary Behavior, and Sleep on Twitter: Multicountry and Fully Labeled Public Data Set for Digital Public Health Surveillance Research.

Abstract

Talk to us

Similar Papers

More From: JMIR Public Health and Surveillance