An Unsupervised Error Detection Methodology for Detecting Mislabels in Healthcare Analytics.

Pei-Yuan Zhou,Faith Lum,Tony Jiecao Wang,Anubhav Bhatti,Surajsinh Parmar,Chen Dan,Andrew K C Wong

doi:10.3390/bioengineering11080770

Pei-Yuan Zhou, Faith Lum + Show 5 more

Open Access

PDF Available

https://doi.org/10.3390/bioengineering11080770

Copy DOI

Export

Save

Cite

Journal: Bioengineering (Basel, Switzerland)	Publication Date: Jul 31, 2024
License type: CC BY 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Medical datasets may be imbalanced and contain errors due to subjective test results and clinical variability. The poor quality of original data affects classification accuracy and reliability. Hence, detecting abnormal samples in the dataset can help clinicians make better decisions. In this study, we propose an unsupervised error detection method using patterns discovered by the Pattern Discovery and Disentanglement (PDD) model, developed in our earlier work. Applied to the large data, the eICU Collaborative Research Database for sepsis risk assessment, the proposed algorithm can effectively discover statistically significant association patterns, generate an interpretable knowledge base for interpretability, cluster samples in an unsupervised learning manner, and detect abnormal samples from the dataset. As shown in the experimental result, our method outperformed K-Means by 38% on the full dataset and 47% on the reduced dataset for unsupervised clustering. Multiple supervised classifiers improve accuracy by an average of 4% after removing abnormal samples by the proposed error detection approach. Therefore, the proposed algorithm provides a robust and practical solution for unsupervised clustering and error detection in healthcare data.

Full Text