Efficient detection of data entry errors in large-scale public health surveys: an unsupervised machine learning approach

Arkaprabha Sau,Santanu Phadikar,Ishita Bhakta

doi:10.1186/s12982-024-00245-3

Abstract

Data entry errors in large-scale public health surveys can undermine the effectiveness of data-driven interventions. Therefore, identifying these data entry errors is crucial for public health experts. In large-scale public health surveys, manually verifying the accuracy of every data point by domain experts is nearly impossible. This study evaluates unsupervised machine learning algorithms for detecting these errors, focusing on the 'weight' parameter in the Annual Health Survey (AHS) dataset. The AHS, conducted by the Ministry of Health and Family Welfare, Government of India, in collaboration with the Registrar General of India, is a large-scale, stratified, household-level survey targeting maternal and child health across nine states in India. The dataset is freely available on the Open Government Data (OGD) Platform of India for public health research. In this study, five algorithms—DBSCAN, K-Means, Gaussian Mixture Model (GMM), Isolation Forest (IF), and One-Class SVM (1C-SVM) were applied to detect erroneous data entries. The evaluation process involved comprehensive preprocessing and feature engineering to optimize detection capabilities. Performance metrics such as precision, recall, accuracy, false anomaly, and missed anomaly rates were used to assess each algorithm. Among these, DBSCAN demonstrated superior performance, achieving a recall of 94.7% and a precision of 81.9%, making it highly effective for this task. The findings underscore the potential of unsupervised machine learning in automating the detection of data entry errors, thereby improving the integrity of public health data. This research contributes to the advancement of precision public health, supporting more accurate and reliable evidence-based decision-making and policy formulation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient detection of data entry errors in large-scale public health surveys: an unsupervised machine learning approach

Abstract

Talk to us

Similar Papers

More From: Discover Public Health

Lead the way for us

Journal: Discover Public Health	Publication Date: Oct 10, 2024
License type: cc-by-nc-nd

Similar Papers

Multilevel Regression and Poststratification Versus Survey Sample Weighting for Estimating Population Quantities in Large Population Health Studies.
Marnie Downes ... John B Carlin
American Journal of Epidemiology | VOL. 189
Marnie Downes, et. al.Marnie Downes ... John B Carlin
14 Apr 2020
American Journal of Epidemiology | VOL. 189

Detection of overdose and underdose prescriptions-An unsupervised machine learning approach.
Kenichiro Nagata ... Kimitaka Suetsugu
PLOS ONE | VOL. 16
Kenichiro Nagata, et. al.Kenichiro Nagata ... Kimitaka Suetsugu
19 Nov 2021
PLOS ONE | VOL. 16

Determinants of Cancer Screening Disparities Among Asian Americans: A Systematic Review of Public Health Surveys.
Jungmi Jun ... Xiaoli Nan
Journal of Cancer Education | VOL. 33
Jungmi Jun, et. al.Jungmi Jun ... Xiaoli Nan
05 Apr 2017
Journal of Cancer Education | VOL. 33

Co-creating a large-scale adolescent health survey integrated with access to digital health interventions
Roshini Peiris-John ... Theresa Fleming
-
Roshini Peiris-John, et. al.Roshini Peiris-John ... Theresa Fleming
02 Sep 2020
02 Sep 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient detection of data entry errors in large-scale public health surveys: an unsupervised machine learning approach

Abstract

Talk to us

Similar Papers

More From: Discover Public Health