On the importance of representative datasets in ECG-based artificial intelligence

N Gumpfer,T Keller,J Prim,J Hannig,S Wegener,M Guckert,D Gruen

doi:10.1093/eurheartj/ehab724.3060

Abstract

Abstract Background/Introduction ECG-based artificial intelligence (AI) is an emerging field in digital cardiology. Training on diseased records vs. healthy controls is common practice. We aimed to evaluate if such an approach can lead to unwanted behaviour in real-world settings and thus unnecessarily reduce diagnostic precision of the developed AI model. Purpose Several studies have shown that deep neural networks are able to exceed performance of medical experts. However, when these models are applied to different cohorts, results vary strongly. We hypothesise that this is because the datasets used for training were not representative for the target population. Methods Based on the public ECG database PTB-XL we sampled three distinct subsets of n=150 records representing ECG groups labelled for diagnoses 'old myocardial infarction' (M), 'normal ECG' (N), or 'other cardiac abnormality' (O). These groups were combined to three datasets ([M, N] (n=300), [M, O] (n=300), [M, N, O] (n=450)), representing different approaches to data sampling. On each dataset, we trained a separate but equally structured deep neural network using 100-fold bootstrapping. The diagnostic performance of each model was validated on unseen data from all datasets with sensitivity, specificity and area under the receiver operator characteristic curve. Results Evaluation of the three differently trained models shows best diagnostic performance on the M vs. N records and worst on the M vs. O records. However, in the out-of-dataset setting, the best-performing model (trained on [M, N]) shows weaker performance on the [M, N, O] and [M, O] datasets. Sensitivity for the same model remained equal, as identical M records were used throughout corresponding bootstrapping folds. Detailed results are presented in Table 1. Conclusions Our results suggest that the model trained on a dataset including only diseased records vs. healthy controls [M, N] learned to recognise healthy (N) instead of diseased (M) records, which explains why it performed poorly on datasets including records showing other cardiac abnormalities (O). Such behaviour is a common problem in AI and requires special attention in dataset sampling. For small cohorts, it is tempting to increase the amount of training data by using healthy controls. However, we have shown that this can be a poor option, since classifiers can more easily rely on features that are not actually related to the target disease. Training and validation of classifiers should therefore be performed on representative datasets that are as close as possible to the target population. Funding Acknowledgement Type of funding sources: Public Institution(s). Main funding source(s): Forschungscampus Mittelhessen, Flexi Funds Table 1

Full Text