Time-sensitive clinical concept embeddings learned from large electronic health records

Yang Xiang,Yujia Zhou,Fang Li,Yuqi Si,Xiaoqian Jiang,Yonghui Wu,Degui Zhi,Firat Tiryaki,Hua Xu,Yaoyun Zhang,Jun Xu,Wenjin Jim Zheng,Cui Tao,Laila Rasmy,Zhiheng Li

doi:10.1186/s12911-019-0766-3

Abstract

BackgroundLearning distributional representation of clinical concepts (e.g., diseases, drugs, and labs) is an important research area of deep learning in the medical domain. However, many existing relevant methods do not consider temporal dependencies along the longitudinal sequence of a patient’s records, which may lead to incorrect selection of contexts.MethodsTo address this issue, we extended three popular concept embedding learning methods: word2vec, positive pointwise mutual information (PPMI) and FastText, to consider time-sensitive information. We then trained them on a large electronic health records (EHR) database containing about 50 million patients to generate concept embeddings and evaluated them for both intrinsic evaluations focusing on concept similarity measure and an extrinsic evaluation to assess the use of generated concept embeddings in the task of predicting disease onset.ResultsOur experiments show that embeddings learned from information within one visit (time window zero) improve performance on the concept similarity measure and the FastText algorithm usually had better performance than the other two algorithms. For the predictive modeling task, the optimal result was achieved by word2vec embeddings with a 30-day sliding window.ConclusionsConsidering time constraints are important in training clinical concept embeddings. We expect they can benefit a series of downstream applications.

Highlights

Learning distributional representation of clinical concepts is an important research area of deep learning in the medical domain
The algorithms include word2vec, positive pointwise mutual information (PPMI)-SVD [14] and FastText [15]
We conduct evaluations on both intrinsic evaluations focusing on concept similarity measure and an extrinsic evaluation to assess the use of generated concept embeddings in the task of predicting disease onset

Summary

Methods

The EHR dataset Cerner Health Facts® is a database that comprises deidentified EHR data from over 600 participating Cerner client hospitals and clinics in the United States and represents over 50 million unique patients (1995–2015) (https:// www.cerner.com/). Most of the existing methods for learning word embeddings lack the consideration of temporal dependencies between adjacent concepts in the modeling stage, which is crucial in the clinical domain and different from language processing These methods treated the neighborhood events (or visits) as adjacent words, and assumed that the events (or visits) in the sliding window reflect the scope of context for prediction (i.e. Med2vec in [10]). We set the time window size to 0 to produce a visit-level embedding matrix, so that only clinical events within the same visit are considered as the context. Obtain concept embeddings by performing SVD on the PPMI matrix M In this method, we set the time window of computing the co-occurrence as 0 (visit-level) or 30 days. We selected the concepts with ICD codes as the evaluation set in the current stage

Results

Conclusions

Background

C2jV ðGÞj i j

Conclusion