Abstract Background/Introduction Many research databases contain anonymised electrocardiograms (ECGs) linked to other sensitive information. ECGs hold features unique to individuals, potentially enabling subject identification from anonymised ECGs. Purpose We assessed if artificial intelligence approaches to output ECG pair similarity can re-identify individuals from anonymised ECGs. Additionally, we aimed to explore clinical risk prediction using ECG similarity over time. Methods We used a convolutional Siamese neural network (SNN), with a triplet loss function, to train a deep learning model determining if two ECGs belong to the same individual (Figure 1). The model aims to encode ECG inputs from same subjects closer than ECG inputs from different subjects. An ECG similarity score is output, corresponding to the probability two ECGs belong to the same subject. This continuous metric can be used in a binary manner. Below a threshold, two ECGs are classed as from the same individual. Our dataset comprised 72,455 secondary care subjects from a USA cohort with 4 to 75 ECGs each, totalling 864,283 ECGs. These were split 50:10:40% at the subject level for training, validation and testing. Results In 2,689,124 same-subject pairs and 2,689,124 different-subject pairs from the test set, the model achieves an accuracy of 91.68% against the binary threshold. This improves to 93.61% and 95.97% in outpatient and normal ECG subsets, respectively. We tested the model using a subject’s ECG to identify them from a group where only one ECG also belongs to that subject. 89.4% success rates occur in groups of 100 normal ECGs (if random, success would be 1%). 91.4%, 95.1%, and 96.3% success occurs when supplemented with subject gender, age (decade-bracket), and both, respectively. We calculated a model certainty score for this task. 99% success occurs when >95% certain. This is in 65% of group size 100 trials. In groups of 1000 normal ECGs, the success rate is 72.5% with no additional information, and 75.9%, 84.2% and 87.5% when supplemented as before (if random, success would be 0.1%). >95% certainty occurs in 33% of group size 1000 trials. For a given subject, ECG pair similarity proved clinically informative. The model accounts for slight variations within a subject’s ECGs, even when far apart in time (Figure 2A). However, it incorrectly predicts two ECGs as being from different individuals for substantial ECG morphology changes (Figure 2B), which could be used to identify clinical deterioration. The temporal evolution of ECG pair similarity reflects clinical trajectory, with greater dissimilarity associating with all-cause mortality (Hazard ratio, 1.22 per 1 standard deviation change, p < 0.0001). Conclusion(s) Anonymised ECGs retain information useful for subject re-identification. Whilst this implies possible ethical and data protection issues with large datasets, such approaches offer potential for continuous monitoring and identifying clinical deterioration.Figure 1Figure 2
Read full abstract