Integrating EMR-linked and in vivo functional genetic data to identify new genotype-phenotype associations.

Jonathan D Mosley,Quinn S Wells,Jessica T Delaney,Peter E Weeke,Lisa Bastarache,Sara L Van Driest,Josh C Denny,Dan M Roden,Joseph Devaney

doi:10.1371/journal.pone.0100322

Abstract

The coupling of electronic medical records (EMR) with genetic data has created the potential for implementing reverse genetic approaches in humans, whereby the function of a gene is inferred from the shared pattern of morbidity among homozygotes of a genetic variant. We explored the feasibility of this approach to identify phenotypes associated with low frequency variants using Vanderbilt's EMR-based BioVU resource. We analyzed 1,658 low frequency non-synonymous SNPs (nsSNPs) with a minor allele frequency (MAF)<10% collected on 8,546 subjects. For each nsSNP, we identified diagnoses shared by at least 2 minor allele homozygotes and with an association p<0.05. The diagnoses were reviewed by a clinician to ascertain whether they may share a common mechanistic basis. While a number of biologically compelling clinical patterns of association were observed, the frequency of these associations was identical to that observed using genotype-permuted data sets, indicating that the associations were likely due to chance. To refine our analysis associations, we then restricted the analysis to 711 nsSNPs in genes with phenotypes in the On-line Mendelian Inheritance in Man (OMIM) or knock-out mouse phenotype databases. An initial comparison of the EMR diagnoses to the known in vivo functions of the gene identified 25 candidate nsSNPs, 19 of which had significant genotype-phenotype associations when tested using matched controls. Twleve of the 19 nsSNPs associations were confirmed by a detailed record review. Four of 12 nsSNP-phenotype associations were successfully replicated in an independent data set: thrombosis (F5,rs6031), seizures/convulsions (GPR98,rs13157270), macular degeneration (CNGB3,rs3735972), and GI bleeding (HGFAC,rs16844401). These analyses demonstrate the feasibility and challenges of using reverse genetics approaches to identify novel gene-phenotype associations in human subjects using low frequency variants. As increasing amounts of rare variant data are generated from modern genotyping and sequence platforms, model organism data may be an important tool to enable discovery.

Highlights

Electronic medical record (EMR) systems store an increasing amount of clinical, laboratory and biometric data generated by health care systems
The spectrum of disease entities collected in EMRs has enabled large-scale bioinformatics approaches such as Phenome-Wide Association Study (PheWAS), which searches in a disease-agnostic fashion for associations between common polymorphisms and hundreds of clinical diseases, identified using billing codes [8,9]
These data sources provide a rich resource for generating biologically-relevant clinical hypotheses based on observations of model organisms that can be tested in a real life setting using large EMRs coupled with DNA repositories, such as the Vanderbilt BioVU resource [16]

Summary

Introduction

Electronic medical record (EMR) systems store an increasing amount of clinical, laboratory and biometric data generated by health care systems. Of the 1,658 nsSNPs initially identified, 440 were located in genes with disease associations in the OMIM database, 555 were in the KO mouse data set.

Results

Conclusion