Bias Associated with Mining Electronic Health Records

George Hripcsak,Charles Knirsch,Li Zhou,Adam Wilcox,Genevieve Melton

doi:10.5210/disco.v6i0.3581

Abstract

Large-scale electronic health record research introduces biases compared to traditional manually curated retrospective research. We used data from a community-acquired pneumonia study for which we had a gold standard to illustrate such biases. The challenges include data inaccuracy, incompleteness, and complexity, and they can produce in distorted results. We found that a naïve approach approximated the gold standard, but errors on a minority of cases shifted mortality substantially. Manual review revealed errors in both selecting and characterizing the cohort, and narrowing the cohort improved the result. Nevertheless, a significantly narrowed cohort might contain its own biases that would be difficult to estimate.

Highlights

With the increasing adoption of electronic health records, there is the potential for over one billion patient visits to be documented per year in the US [1], and these data should be a boon to clinical research [2]
Between the years 1996 and 1999, the clinical data warehouse had 49,642 inpatient and ambulatory cases that had some indication of pneumonia, and 18,715 cases had corroboratory evidence and were considered community acquired
A manual review revealed two main problems: (1) many subjects did not have pneumonia and (2) many subjects in Class I should have been in Class III or higher, including several intensive care unit cases

Summary

Introduction

With the increasing adoption of electronic health records, there is the potential for over one billion patient visits to be documented per year in the US [1], and these data should be a boon to clinical research [2]. Large‐scale electronic health record‐based research is more challenging than traditional retrospective studies, . The record is frequently inaccurate [3], incomplete, and complex. A human expert reads the data sources—which may include electronic health records—for each subject, interprets them, and records more reliable variables. Data that are obviously inaccurate or contradictory are adjudicated, missing variables are frequently filled in by inferring related information from other variables, and deeply nested information is interpreted in the context of the study. Large‐scale electronic health record‐based research hopes to process huge numbers of subjects without subject‐by‐subject human intervention. Study analysts must attempt to mimic the reasoning that researchers apply to individual records.

Methods

Results

Conclusion