Data mining to retrieve smoking status from electronic health records in general practice.

Annemarijn R De Boer,Michiel L Bots,Ilonca Vaartjes,Saskia Haitjema,Sander Van Doorn,Mark C H De Groot,T Katrien J Groenhof

doi:10.1093/ehjdh/ztac031

Annemarijn R De Boer, Michiel L Bots + Show 5 more

Open Access

https://doi.org/10.1093/ehjdh/ztac031

Copy DOI

Abstract

Optimize and assess the performance of an existing data mining algorithm for smoking status from hospital electronic health records (EHRs) in general practice EHRs. We optimized an existing algorithm in a training set containing all clinical notes from 498 individuals (75 712 contact moments) from the Julius General Practitioners' Network (JGPN). Each moment was classified as either 'current smoker', 'former smoker', 'never smoker', or 'no information'. As a reference, we manually reviewed EHRs. Algorithm performance was assessed in an independent test set (n = 494, 78 129 moments) using precision, recall, and F1-score. Test set algorithm performance for 'current smoker' was precision 79.7%, recall 78.3%, and F1-score 0.79. For former smoker, it was precision 73.8%, recall 64.0%, and F1-score 0.69. For never smoker, it was precision 92.0%, recall 74.9%, and F1-score 0.83. On a patient level, performance for ever smoker (current and former smoker combined) was precision 87.9%, recall 94.7%, and F1-score 0.91. For never smoker, it was 98.0, 82.0, and 0.89%, respectively. We found a more narrative writing style in general practice than in hospital EHRs. Data mining can successfully retrieve smoking status information from general practice clinical notes with a good performance for classifying ever and never smokers. Differences between general practice and hospital EHRs call for optimization of data mining algorithms when applied beyond a primary development setting.

Full Text