Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms

D.J Albers,N Elhadad,J Claassen,R Perotte,A Goldstein,G Hripcsak

doi:10.1016/j.jbi.2018.01.004

Abstract

We study the question of how to represent or summarize raw laboratory data taken from an electronic health record (EHR) using parametric model selection to reduce or cope with biases induced through clinical care. It has been previously demonstrated that the health care process (Hripcsak and Albers, 2012, 2013), as defined by measurement context (Hripcsak and Albers, 2013; Albers et al., 2012) and measurement patterns (Albers and Hripcsak, 2010, 2012), can influence how EHR data are distributed statistically (Kohane and Weber, 2013; Pivovarov et al., 2014). We construct an algorithm, PopKLD, which is based on information criterion model selection (Burnham and Anderson, 2002; Claeskens and Hjort, 2008), is intended to reduce and cope with health care process biases and to produce an intuitively understandable continuous summary. The PopKLD algorithm can be automated and is designed to be applicable in high-throughput settings; for example, the output of the PopKLD algorithm can be used as input for phenotyping algorithms. Moreover, we develop the PopKLD-CAT algorithm that transforms the continuous PopKLD summary into a categorical summary useful for applications that require categorical data such as topic modeling. We evaluate our methodology in two ways. First, we apply the method to laboratory data collected in two different health care contexts, primary versus intensive care. We show that the PopKLD preserves known physiologic features in the data that are lost when summarizing the data using more common laboratory data summaries such as mean and standard deviation. Second, for three disease-laboratory measurement pairs, we perform a phenotyping task: we use the PopKLD and PopKLD-CAT algorithms to define high and low values of the laboratory variable that are used for defining a disease state. We then compare the relationship between the PopKLD-CAT summary disease predictions and the same predictions using empirically estimated mean and standard deviation to a gold standard generated by clinical review of patient records. We find that the PopKLD laboratory data summary is substantially better at predicting disease state. The PopKLD or PopKLD-CAT algorithms are not meant to be used as phenotyping algorithms, but we use the phenotyping task to show what information can be gained when using a more informative laboratory data summary. In the process of evaluation our method we show that the different clinical contexts and laboratory measurements necessitate different statistical summaries. Similarly, leveraging the principle of maximum entropy we argue that while some laboratory data only have sufficient information to estimate a mean and standard deviation, other laboratory data captured in an EHR contain substantially more information than can be captured in higher-parameter models.

Highlights

Electronic health record (EHR) data offer us the opportunity to carry out clinical research on a broad population relatively quickly while minimizing both the financial and human costs because the data are collected for health care
The results from the PopKLD algorithm for 64 common laboratory values are found in Table 1; the laboratory values included are split into clinically relevant groupings, including metabolic, blood gasses, whole blood, differential, hepatobiliary, lipids, anemia, cardiac, hormone, inflammatory, vitamin and urinary laboratory values
The intensive care unit (ICU)-restricted glucose is included in an attempt to isolate the data generated primarily due to physiology and with relatively minimal health care process bias due to collection context

Summary

Introduction

Electronic health record (EHR) data offer us the opportunity to carry out clinical research on a broad population relatively quickly while minimizing both the financial and human costs because the data are collected for health care. A number of simple summarization techniques have been employed, such as using the presence, last value, the median, the mean, the standard deviation, or similar variations. These summaries assume that the important information in the measurements can be conveyed in one or two parameters (e.g., mean and standard deviation). For high-throughput phenotyping the selection of a summary technique would have to be automated given the number of potential variables and phenotypes

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Biomedical Informatics	Publication Date: Jan 31, 2018
Citations: 22	License type: cc-by

R Discovery Prime

R Discovery Prime

Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Biomedical Informatics

Lead the way for us

Similar Papers

Big Data, Predictive Analytics, and Quality Improvement in Kidney Transplantation: A Proof of Concept.
T.R. Srinivas ... D. Northrup
American journal of transplantation : official journal of the American Society of Transplantation and the American Society of Transplant Surgeons | VOL. 17
T.R. Srinivas, et. al.T.R. Srinivas ... D. Northrup
04 Jan 2017
04 Jan 2017

Rapid identification of chronic kidney disease in electronic health record database using computable phenotype combining a common data model.
Huai-Yu Wang ... Guohui Ding
Chinese Medical Journal | VOL. 136
Huai-Yu Wang, et. al.Huai-Yu Wang ... Guohui Ding
05 Apr 2023
Chinese Medical Journal | VOL. 136

A Phenotyping Algorithm to Identify People With HIV in Electronic Health Record Data (HIV-Phen): Development and Evaluation Study.
Sarah B May ... Thomas P Giordano
JMIR Formative Research | VOL. 5
Sarah B May, et. al.Sarah B May ... Thomas P Giordano
25 Nov 2021
JMIR Formative Research | VOL. 5

Comparison of Electronic Laboratory Reports, Administrative Claims, and Electronic Health Record Data for Acute Viral Hepatitis Surveillance
Joshua Allen-Dicker ... Michael Klompas
Journal of Public Health Management and Practice | VOL. 18
Joshua Allen-Dicker, et. al.Joshua Allen-Dicker ... Michael Klompas
01 May 2012
Journal of Public Health Management and Practice | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Biomedical Informatics