Abstract
A large number of atrial fibrillation (AF) detectors have been published in recent years, signifying that the comparison of detector performance plays a central role, though not always consistent. The aim of this study is to shed needed light on aspects crucial to the evaluation of detection performance. Three types of AF detector, using either information on rhythm, rhythm and morphology, or segments of ECG samples, are implemented and studied on both real and simulated ECG signals. The properties of different performance measures are investigated, for example, in relation to dataset imbalance. The results show that performance can differ considerably depending on the way detector output is compared to database annotations, i.e., beat-to-beat, segment-to-segment, or episode-to-episode comparison. Moreover, depending on the type of detector, the results substantiate that physiological and technical factors, e.g., changes in ECG morphology, rate of atrial premature beats, and noise level, can have a considerable influence on performance. The present study demonstrates overall strengths and weaknesses of different types of detector, highlights challenges in AF detection, and proposes five recommendations on how to handle data and characterize performance.
Highlights
T HE recent interest in deep learning (DL) has led to an avalanche of atrial fibrillation (AF) detectors, e.g., [1]–[17]
The present paper addresses aspects crucial to the evaluation of AF detector performance, leading up to a set of investigation-based recommendations on how to handle data and characterize performance
To shed further light on data imbalance, the performance of the rhythm-based detector is studied on 103 recordings from Saint Petersburg Atrial Fibrillation Database (SPAFDB), Atrial Fibrillation Database (AFDB), and Long-Term AF Database (LTAFDB); 40 recordings with AF burden < 1% and > 99% were excluded
Summary
T HE recent interest in deep learning (DL) has led to an avalanche of atrial fibrillation (AF) detectors, e.g., [1]–[17]. T HE recent interest in deep learning (DL) has led to an avalanche of atrial fibrillation (AF) detectors, e.g., [1]–. The problem of how to evaluate and compare performance between different detectors, whether based on DL or expert-crafted features, is brought into focus. To outline a framework for evaluation that ensures a fair comparison and goes beyond reporting overall performance measures is essential. While public databases facilitate the comparison of detector performance, conclusions should be made with caution for Manuscript received October 21, 2020; revised January 28, 2021; accepted March 16, 2021.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have