Abstract

Biobanks and national registries represent a powerful tool for genomic discovery, but rely on diagnostic codes that can be unreliable and fail to capture relationships between related diagnoses. We developed an efficient means of conducting genome-wide association studies using combinations of diagnostic codes from electronic health records for 10,845 participants in a biobanking program at two large academic medical centers. Specifically, we applied latent Dirichilet allocation to fit 50 disease topics based on diagnostic codes, then conducted a genome-wide common-variant association for each topic. In sensitivity analysis, these results were contrasted with those obtained from traditional single-diagnosis phenome-wide association analysis, as well as those in which only a subset of diagnostic codes were included per topic. In meta-analysis across three biobank cohorts, we identified 23 disease-associated loci with p < 1e-15, including previously associated autoimmune disease loci. In all cases, observed significant associations were of greater magnitude than single phenome-wide diagnostic codes, and incorporation of less strongly loading diagnostic codes enhanced association. This strategy provides a more efficient means of identifying phenome-wide associations in biobanks with coded clinical data.

Highlights

  • In the search for common genetic variations associated with medical disorders, the traditional analytic approach examines single disorders in case-control cohorts ascertained for a specific disorder

  • Approaches that focus on individual diagnostic codes are limited by inaccurate, missing or heterogeneous diagnoses; eg, where individuals with cystic fibrosis might be represented by male infertility, diabetes and chronic rhinosinusitis even in the absence of a diagnostic code for cystic fibrosis [4]

  • Cohort Derivation and Genotyping We drew on three cohorts of patients seen in the Brigham and Women’s ­Hospital network and the Massachusetts General Hospital network, representing the first 15,064 individuals genotyped as part of the Partners HealthCare Biobank initiative [10]. These individuals provided informed consent for their electronic health records (EHRs) to be examined in investigations approved by the Partners Institutional Review Board, and provided blood samples for DNA extraction

Read more

Summary

Introduction

In the search for common genetic variations associated with medical disorders, the traditional analytic approach examines single disorders in case-control cohorts ascertained for a specific disorder. With the availability of large-scale biobanks with broad ascertainment, multiple approaches to phenome-wide association – ie, looking across a range of clinical phenotypes to detect genetic association – have been proposed [1]. Relying on individual disorders represented in diagnostic codes may not efficiently capture the underlying architecture of genetic risk. Under conditions of pleiotropy, where a single variant contributes to risk for multiple disorders, as in some autoimmune and neuropsychiatric disorders, standard phenome-wide approaches do not make efficient use of the correlation structure between diagnoses. Single-code approaches do not capture disease subtypes with different genetic architecture, where these subtypes may be reflected in different patterns of comorbidity, as a recent investigation of diabetes mellitus suggests [5,6,7,8]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call