Integrating statistical and visual analytic methods for bot identification of health-related survey data

Annie T Chen,Midori Komi,Sierrah Bessler,Sean P Mikles,Yan Zhang

doi:10.1016/j.jbi.2023.104439

Abstract

ObjectiveIn recent years, we have increasingly observed issues concerning quality of online information due to misinformation and disinformation. Aside from social media, there is growing awareness that questionnaire data collected using online recruitment methods may include suspect data provided by bots. Issues with data quality can be particularly problematic in health and/or biomedical contexts; thus, developing robust methods for suspect data identification and removal is of paramount importance in informatics. In this study, we describe an interactive visual analytics approach to suspect data identification and removal and demonstrate the application of this approach on questionnaire data pertaining to COVID-19 derived from different recruitment venues, including listservs and social media. MethodsWe developed a pipeline for data cleaning, pre-processing, analysis, and automated ranking of data to address data quality issues. We then employed the ranking in conjunction with manual review to identify suspect data and remove them from subsequent analyses. Last, we compared differences in the data before and after removal. ResultsWe performed data cleaning, pre-processing, and exploratory analysis on a survey dataset (N = 4,163) collected using multiple recruitment mechanins using the Qualtrics survey platform. Based on these results, we identified suspect features and used these to generate a suspect feature indicator for each survey response. We excluded survey responses that did not fit the inclusion criteria for the study (n = 29) and then performed manual review of the remaining responses, triangulating with the suspect feature indicator. Based on this review, we excluded 2,921 responses. Additional responses were excluded based on a spam classification by Qualtrics (n=13), and the percentage of survey completion (n=328), resulting in a final sample size of 872. We performed additional analyses to demonstrate the extent to which the suspect feature indicator was congruent with eventual inclusion, as well as compared the characteristics of the included and excluded data. ConclusionOur main contributions are: 1) a proposed framework for data quality assessment, including suspect data identification and removal; 2) the analysis of potential consequences in terms of representation bias in the dataset; and 3) recommendations for implementation of this approach in practice.

Full Text