Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

Danqing Xu,Chen Wang,Shawn Murphy,Atlas Khan,Yizhao Ni,Adam Gordon,Iftikhar J Kullo,Zihuai He,Iuliana Ionita-Laza,Ning Shang,Chunhua Weng,Wei-Qi Wei,Krzysztof Kiryluk,Ali Gharavi

doi:10.1038/s41746-021-00488-3

Abstract

Labeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.

Highlights

The increasing availability of rich phenotype data from electronic health records (EHR), such as the multicenter Electronic Medical Records and Genomics network[1,2], BioVU3 from Vanderbilt University, the Geisinger Health System’s DiscovEHR in Pennsylvania[4], the Harvard University/Partners Healthcare system i2b2 effort[5], the United Kingdom Biobank (UKBB)[6], and their linking to biobanks of human germline DNA samples provides great opportunities for genomic-based research[7,8,9]
We have proposed an almost unsupervised method to derive a quantitative disease risk score, linear combination of multiple principal components (LPC), based on phenotypic features available in EHR from health systems
The proposed quantitative disease risk score has several advantages: (1) it can be derived on a large number of individuals using only minimal clinical input, (2) it can be derived with only weak labels, as opposed to limited gold/silver standard label information that may be available in the EHR, (3) it can help stratify individuals according to disease risk severity, and identify undiagnosed cases, (4) it can identify disease-relevant features, and (5) it can take advantage of biobanks linked to clinical information from EHR to perform potentially more powerful genetic association studies

Summary

Introduction

The increasing availability of rich phenotype data from electronic health records (EHR), such as the multicenter Electronic Medical Records and Genomics (eMERGE) network[1,2], BioVU3 from Vanderbilt University, the Geisinger Health System’s DiscovEHR in Pennsylvania[4], the Harvard University/Partners Healthcare system i2b2 effort[5], the United Kingdom Biobank (UKBB)[6], and their linking to biobanks of human germline DNA samples provides great opportunities for genomic-based research[7,8,9]. Inferring phenotypes from International Classification of Diseases (ICD) codes is not trivial, and many algorithms have already been proposed[10] These algorithms can generate high-quality case/control labels for specific diseases, a main limitation is that they require extensive knowledge and involvement of human experts, are time-consuming, are not systematically applied, and can lead to inconsistencies of case definition for different algorithms[11]. They tend to perpetuate the view of common diseases as discrete entities rather than residing on a continuum. Thinking quantitatively about common diseases could prove beneficial for genomic studies of phenotypes derived from EHR12,13

Methods

Results

Conclusion