A data-pipeline processing electrocardiogram recordings for use in artificial intelligence algorithms

J Prim,S Wegener,J Hannig,T Uhlemann,D Gruen,S Krug,N Gumpfer,M Guckert,T Keller

doi:10.1093/eurheartj/ehab724.3041

Abstract

Abstract Introduction Artificial intelligence (AI) can be used for various tasks in medicine and specifically in cardiology. Medical data such as electrocardiogram recordings (ECGs) are widely used and universally accepted as diagnostic and prognostic tools. It has been shown that deep learning methods using ECGs yield excellent results detecting cardiac pathologies. A significant amount of reliable data is required for supervised learning algorithms such as deep learning models. However, only a small fraction of ECG data generated in daily practice is available in a fully digital and machine-readable format, such as XML. Frequently, used ECG devices produce PDF files or even paper-based print outs, which need to be digitised later for inclusion in clinical information systems. Such ECGs cannot be used without further effort for training or application of deep learning models. Therefore, aim of the present project was to develop a data-pipeline that generates machine-readable ECG data for AI use data irrespective of the initial ECG format. Methods We propose an end-to-end pipeline that can not only process data from modern digital ECG devices but is also capable of extracting all necessary information from PDF files (both scanned hard copies and digitally generated PDFs) (see Figure 1). By using different techniques including adaption of open source libraries for vectorisation of image data, and modern computer vision technologies, such as optical character recognition (OCR), our pipeline is able to flexibly process data from different recording devices and read both data in PDF format and data from native digital devices delivered in XML. The processed files from various sources are either saved as a common and easily accessible CSV file format, or are processed directly with deep learning models (see Figure 2). Results The developed data-pipeline was validated using data from a set of 113 12-lead ECGs for which data was available in multiple formats. Each format dataset was separately processed by our pipeline and then used for training and validation of a deep learning architecture for myocardial scar detection based on raw ECG signals. The quality of the extraction process by our pipeline was assessed by the respective deep learning models with their prediction capability depicted by receiver operator characteristic analyses (ROC). Comparing the benchmark model that was generated from XML data against a model that was purely trained on PDF data processed by the pipeline shows that both models produced comparable results, reaching area under the curve (AUC) values of 0:79±0:10 (XML) and 0:83±0:07 (PDF). Conclusion The data pipeline facilitates acceleration of ECG-based AI research and application of AI algorithms by providing access to ECG data irrespective of the format of the stored ECG. Future work will focus on independent validation as well as expanding this pipeline to include additional ECG types. Funding Acknowledgement Type of funding sources: Public Institution(s). Main funding source(s): Flexi Funds by Forschungscampus Mittelhessen

Full Text