Abstract

Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center’s Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013–2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub.

Highlights

  • Background & SummarySharing clinical data is important for reproducible biomedical research and can drive discovery

  • One such method is the sharing of prevalence and co-occurrence statistics of medical events

  • To accelerate translational biomedical research, we present Columbia Open Health Data (COHD), a database of electronic health record (EHR) prevalence and co-occurrence frequencies on conditions, drugs, procedures, and demographics observed per patient at Columbia University Irving Medical Center (CUIMC), covering 36,578 single concepts and 32,788,901 concept pairs from 5,364,781 patients

Read more

Summary

Background & Summary

Sharing clinical data is important for reproducible biomedical research and can drive discovery. The American Cancer Society releases annual reports of cancer statistics in the United States collected from cancer registries, including incidence, mortality, and survival from 46 anatomical cancer sites[11] These reports can accurately estimate the disease prevalence in the general population, this knowledge is difficult to consume at scale, as it requires manual literature review. To accelerate translational biomedical research, we present Columbia Open Health Data (COHD), a database of EHR prevalence and co-occurrence frequencies on conditions, drugs, procedures, and demographics (sex, race, and ethnicity) observed per patient at Columbia University Irving Medical Center (CUIMC), covering 36,578 single concepts and 32,788,901 concept pairs from 5,364,781 patients. Institutions interested in joining the Observational Health Data Sciences and Informatics (OHDSI) Research Network will find an active and open community ready to help integrate new partners These data are available for download from the Figshare repository (Data Citation 1)

Methods
Procedure
Observation Procedure
Findings
Limitations
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call