Cohort design and natural language processing to reduce bias in electronic health records research

Shaan Khurshid,Alice Mcelhinney,Steven A Lubitz,Andrea Derix,Jonathan W Cunningham,Gopal P Sarma ,Anthony Philippakis ,Pulkit Singh,Ashby C Turner ,Christian Diedrich,Emily S Lau,Xin Wang,Mostafa A Al-Alusi ,Nathaniel Diamant,Jeffrey M Ashburner,Steven J Atlas,Jennifer E Ho,Christopher Anderson ,Mercedeh Ghadessi,Puneet Batra,Marcus D R Klarqvist,Paolo Di Achille,Julian S Haimovich ,Hanna M Eilken,Christopher Reeder,Johanna Mielke,Samuel Friedman ,Patrick T Ellinor,Lia X Harrington

doi:10.1038/s41746-022-00590-0

Abstract

Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95–0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012–0.030 in C3PO vs. 0.028–0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: npj Digital Medicine	Publication Date: Apr 8, 2022
Citations: 40	License type: open-access

R Discovery Prime

R Discovery Prime

Cohort design and natural language processing to reduce bias in electronic health records research

Abstract

Talk to us

Similar Papers

More From: npj Digital Medicine

Lead the way for us

Similar Papers

Big Data, Predictive Analytics, and Quality Improvement in Kidney Transplantation: A Proof of Concept.
T.R Srinivas ... G Mour
American Journal of Transplantation | VOL. 17
T.R Srinivas, et. al.T.R Srinivas ... G Mour
04 Jan 2017
American Journal of Transplantation | VOL. 17

Leveraging electronic health records for clinical research
Sudha R Raman ... Adrian F Hernandez
American Heart Journal | VOL. 202
Sudha R Raman, et. al.Sudha R Raman ... Adrian F Hernandez
30 Apr 2018
American Heart Journal | VOL. 202

Ethical, Legal, and Social Issues Related to the Inclusion of Individuals With Intellectual Disabilities in Electronic Health Record Research: Scoping Review.
Melissa Raspa ... Laura Wagner
Journal of Medical Internet Research | VOL. 22
Melissa Raspa, et. al.Melissa Raspa ... Laura Wagner
21 May 2020
Journal of Medical Internet Research | VOL. 22

Natural Language Processing to Improve Prediction of Incident Atrial Fibrillation Using Electronic Health Records.
Jeffrey M Ashburner ... Katherine P Liao
Journal of the American Heart Association | VOL. 11
Jeffrey M Ashburner, et. al.Jeffrey M Ashburner ... Katherine P Liao
29 Jul 2022
Journal of the American Heart Association | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cohort design and natural language processing to reduce bias in electronic health records research

Abstract

Talk to us

Similar Papers

More From: npj Digital Medicine