Active label cleaning for improved dataset quality under resource constraints

Mélanie Bernhardt,Anton Schwaighofer,Aditya Nori,Kerem C Tezcan,Daniel C Castro,Matthew P Lungren,Ben Glocker,Ozan Oktay,Ryutaro Tanno,Shruthi Bannur,Javier Alvarez-Valle,Miguel Monteiro

doi:10.1038/s41467-022-28818-3

Abstract

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation—which we term “active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts’ valuable time for improving dataset quality.

Highlights

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance
Due to the practical constraints on the total number of reannotations, samples often need to be prioritised to maximise the benefits of relabelling efforts, as the difficulty of reviewing labelling errors can vary across samples
While there are learning approaches designed to handle label noise during training, we claim that these strategies can benefit from active labelling for two main reasons: First, clean evaluation labels are often unavailable in practice, in which case one cannot reliably determine whether any trained model is effective for a given real-world application

Summary

Introduction

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. There is a need for relabelling strategies that consider both resource constraints and individual sample difficulty—especially in healthcare, where availability of experts is limited and variability of annotations is typically high due to the difficulty of the tasks[11]. Models trained with these approaches can still learn biases from the noisy data, which may lead them to fail to identify incorrect labels, flag already correct ones, or even introduce additional label noise via self-confirmation. Active label cleaning complements this perspective, aiming to correct potential biases by improving the quality of training dataset and preserving as many samples as possible. This is imperative in safety-critical domains such as healthcare, as model robustness must be validated on clean labels

Objectives

Methods

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nature Communications	Publication Date: Mar 4, 2022
Citations: 39	License type: open-access

R Discovery Prime

R Discovery Prime

Active label cleaning for improved dataset quality under resource constraints

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature Communications

Lead the way for us

Similar Papers

Modelling treatment benefit for bexmarilimab (an anti-Clever-1 antibody and a novel macrophage reprogrammer) using phase I/II first-in-man trial data.
Petri Bono ... Jussi Koivunen
Journal of Clinical Oncology | VOL. 40
Petri Bono, et. al.Petri Bono ... Jussi Koivunen
01 Jun 2022
Journal of Clinical Oncology | VOL. 40

Latency Optimization for Blockchain-Empowered Federated Learning in Multi-Server Edge Computing
Dinh C Nguyen ... Seyyedali Hosseinalipour
IEEE Journal on Selected Areas in Communications | VOL. 40
Dinh C Nguyen, et. al.Dinh C Nguyen ... Seyyedali Hosseinalipour
01 Dec 2022
IEEE Journal on Selected Areas in Communications | VOL. 40

Decentralized Federated Learning for Healthcare Networks: A Case Study on Tumor Segmentation
Bernardo Camajori Tedeschini ... Luca Barbieri
IEEE Access | VOL. 10
Bernardo Camajori Tedeschini, et. al.Bernardo Camajori Tedeschini ... Luca Barbieri
01 Jan 2021
IEEE Access | VOL. 10

Human-in-the-Loop Design Cycles – A Process Framework that Integrates Design Sprints, Agile Processes, and Machine Learning with Humans
Chaehan So
-
Chaehan SoChaehan So
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Active label cleaning for improved dataset quality under resource constraints

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature Communications