The impact of inconsistent human annotations on AI driven clinical decision making

Aneeta Sylolypavan,Derek Sleeman,Honghan Wu,Malcolm Sim

doi:10.1038/s41746-023-00773-3

Abstract

In supervised learning model development, domain experts are often used to provide the class labels (annotations). Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inherent expert bias, judgments, and slips, among other factors. While their existence is relatively well-known, the implications of such inconsistencies are largely understudied in real-world settings, when supervised learning is applied on such ‘noisy’ labelled data. To shed light on these issues, we conducted extensive experiments and analyses on three real-world Intensive Care Unit (ICU) datasets. Specifically, individual models were built from a common dataset, annotated independently by 11 Glasgow Queen Elizabeth University Hospital ICU consultants, and model performance estimates were compared through internal validation (Fleiss’ κ = 0.383 i.e., fair agreement). Further, broad external validation (on both static and time series datasets) of these 11 classifiers was carried out on a HiRID external dataset, where the models’ classifications were found to have low pairwise agreements (average Cohen’s κ = 0.255 i.e., minimal agreement). Moreover, they tend to disagree more on making discharge decisions (Fleiss’ κ = 0.174) than predicting mortality (Fleiss’ κ = 0.267). Given these inconsistencies, further analyses were conducted to evaluate the current best practices in obtaining gold-standard models and determining consensus. The results suggest that: (a) there may not always be a “super expert” in acute clinical settings (using internal and external validation model performances as a proxy); and (b) standard consensus seeking (such as majority vote) consistently leads to suboptimal models. Further analysis, however, suggests that assessing annotation learnability and using only ‘learnable’ annotated datasets for determining consensus achieves optimal models in most cases.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: npj Digital Medicine	Publication Date: Feb 21, 2023
Citations: 22	License type: open-access

R Discovery Prime

R Discovery Prime

The impact of inconsistent human annotations on AI driven clinical decision making

Abstract

Talk to us

Similar Papers

More From: npj Digital Medicine

Lead the way for us

Similar Papers

Why a Trade-Off? The Relationship between the External and Internal Validity of Experiments
Maria Jimenez-Buedo ... Luis Miguel Miller
THEORIA | VOL. 25
Maria Jimenez-Buedo, et. al.Maria Jimenez-Buedo ... Luis Miguel Miller
27 Sep 2010
THEORIA | VOL. 25

Prediction of 30-day mortality in heart failure patients with hypoxic hepatitis: Development and external validation of an interpretable machine learning model.
Run Sun ... Yansong Dong
Frontiers in cardiovascular medicine | VOL. 9
Run Sun, et. al.Run Sun ... Yansong Dong
28 Oct 2022
Frontiers in cardiovascular medicine | VOL. 9

Assessing methodological quality and biological plausibility in occupational health psychology.
Michiel Kompier ... Toon W Taris
Scandinavian journal of work, environment & health | VOL. 30
Michiel Kompier, et. al.Michiel Kompier ... Toon W Taris
01 Apr 2004
Scandinavian journal of work, environment & health | VOL. 30

Selecting and Improving Quasi-Experimental Designs in Effectiveness and Implementation Research.
Margaret A Handley ... Adithya Cattamanchi
Annual Review of Public Health | VOL. 39
Margaret A Handley, et. al.Margaret A Handley ... Adithya Cattamanchi
12 Jan 2018
Annual Review of Public Health | VOL. 39

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The impact of inconsistent human annotations on AI driven clinical decision making

Abstract

Talk to us

Similar Papers

More From: npj Digital Medicine