Abstract

The detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. In healthcare, data can be inherently poor-quality due to uncertainty or subjectivity, but as is often the case, the requirement for data privacy restricts AI practitioners from accessing raw training data, meaning manual visual verification of private patient data is not possible. Here we describe a novel method for automated identification of poor-quality data, called Untrainable Data Cleansing. This method is shown to have numerous benefits including protection of private patient data; improvement in AI generalizability; reduction in time, cost, and data needed for training; all while offering a truer reporting of AI performance itself. Additionally, results show that Untrainable Data Cleansing could be useful as a triage tool to identify difficult clinical cases that may warrant in-depth evaluation or additional testing to support a diagnosis.

Highlights

  • The detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models

  • The accuracy achieved after a second round of Untrainable Data Cleansing (UDC) (99.7%) to the symmetric case (30%, 30%) showed an improvement even when compared to the baseline accuracy (99.2%) on datasets with 0% synthetic error

  • Further tests would be required to confirm the statistical significance of this uplift, but it is not unreasonable that the UDC could filter out noisy data that may be present in the original clean dataset, helping to recover but surpass the accuracy of models trained on the baseline datasets

Read more

Summary

Introduction

The detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. We describe a novel method for automated identification of poor-quality data, called Untrainable Data Cleansing This method is shown to have numerous benefits including protection of private patient data; improvement in AI generalizability; reduction in time, cost, and data needed for training; all while offering a truer reporting of AI performance itself. AI models are trained with labeled or annotated data (medical images) and learn complex features of the images that relate to a clinical outcome, which can be applied to classify new unseen medical images Applications of this technology in healthcare span a wide range of domains including but not limited to ­dermatology7,8, ­radiology9,10, ­ophthalmology11–13, ­pathology[14,15,16], and embryo quality assessment in ­IVF17. Noisy data: Data itself is of poor quality (e.g. out-of-focus image), making it ambiguous or uninformative, with insufficient information or distinguishing features to correlate with any label

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.