Abstract

BackgroundLearning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events – modeled in an ensemble of semantic spaces – for the purpose of predictive modeling.MethodsThree different ways of exploiting a set of (ten) distributed representations of four types of clinical events – diagnosis codes, drug codes, measurements, and words in clinical notes – are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces – corresponding to the considered data types – of a given context window size.ResultsThe proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases.ConclusionsThe strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy – significantly outperforming the considered alternatives – involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.

Highlights

  • Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations

  • We have previously proposed a means of representing heterogeneous data types by first learning deep representations of clinical events based on their distribution in electronic health record (EHR)

  • We investigate alternative ways of making use of semantic space ensembles in conjunction with ensemble methods bagging and random subspacing used in the random forest learning algorithm

Read more

Summary

Introduction

Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The high dimensionality of the data, in turn, typically renders it extremely sparse since patients, within a given care episode, are only exposed to a very small subset of the clinical events used for describing the training sample This is known as the curse of dimensionality and makes it difficult to apply statistical methods to healthcare data. Structured EHR data includes diagnosis codes (in the form of, e.g., ICD), drug codes (in the form of, e.g., ATC) and measurements (typically in the form of institutionspecific encoding) Using these data types inevitably gives rise to questions of representation, how to handle values missing at random or not, and how to take into account the temporality of clinical events. These issues have been addressed in a number of studies [2,3,4,5,6,7]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call