Abstract

The database of Genotypes and Phenotypes (dbGaP) contains various types of data generated from genome-wide association studies (GWAS). These data can be used to facilitate novel scientific discoveries and to reduce cost and time for exploratory research. However, idiosyncrasies and inconsistencies in phenotype variable names are a major barrier to reusing these data. We addressed these challenges in standardizing phenotype variables by formalizing their descriptions using Clinical Element Models (CEM). Designed to represent clinical data, CEMs were highly expressive and thus were able to represent a majority (77.5%) of the 215 phenotype variable descriptions. However, their high expressivity also made it difficult to directly apply them to research data such as phenotype variables in dbGaP. Our study suggested that simplification of the template models makes it more straightforward to formally represent the key semantics of phenotype variables.

Highlights

  • With the advancements in genome-wide association studies (GWAS), the number of public genotypic and phenotypic data repositories, such as the database of Genotypes and Phenotypes, has significantly increased [1,2]

  • In Phase II, we modeled 200 original phenotype variables selected from two phenotype data dictionaries in database of Genotypes and Phenotypes (dbGaP) using the Clinical Element Models (CEM) template models

  • We investigated whether CEM template models could be applied to formalize the dbGaP variables

Read more

Summary

Introduction

With the advancements in genome-wide association studies (GWAS), the number of public genotypic and phenotypic data repositories, such as the database of Genotypes and Phenotypes (dbGaP), has significantly increased [1,2]. Phenotype variables are often named without a specific naming convention, or are often labeled with abbreviated codes that do not convey clear meaning. Many of these variables are accompanied by descriptions that help users understand what data the variable intends to represent. Challenges in standardizing phenotype variables in dbGaP dbGaP contains various types of data generated in many GWAS studies, such as phenotypes, genotypes, and pedigree information of subjects, as well as specifics on samples, measurements and experiments. Challenges associated with non-standardized phenotype variables generated in different research institutions are widely recognized [6]. Standardization of the phenotype variables collected from different institutions/studies is a common challenge for these initiatives [7]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call