Automated curation of noisy external data in the medical domain has long been in high demand, as AI technologies need to be validated using various sources with clean, annotated data. Identifying the variance between internal and external sources is a fundamental step in curating a high-quality dataset, as the data distributions from different sources can vary significantly and subsequently affect the performance of AI models. The primary challenges for detecting data shifts are - (1) accessing private data across healthcare institutions for manual detection and (2) the lack of automated approaches to learn efficient shift-data representation without training samples. To overcome these problems, we propose an automated pipeline called MedShift to detect top-level shift samples and evaluate the significance of shift data without sharing data between internal and external organizations. MedShift employs unsupervised anomaly detectors to learn the internal distribution and identify samples showing significant shiftness for external datasets, and then compares their performance. To quantify the effects of detected shift data, we train a multi-class classifier that learns internal domain knowledge and evaluates the classification performance for each class in external domains after dropping the shift data. We also propose a data quality metric to quantify the dissimilarity between internal and external datasets. We verify the efficacy of MedShift using musculoskeletal radiographs (MURA) and chest X-ray datasets from multiple external sources. Our experiments show that our proposed shift data detection pipeline can be beneficial for medical centers to curate high-quality datasets more efficiently. The code can be found at https://github.com/XiaoyuanGuo/MedShift. An interface introduction video to visualize our results is available at https://youtu.be/V3BF0P1sxQE.
Read full abstract