Abstract

The clinical notes in a given patient record contain much redundancy, in large part due to clinicians’ documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, non-redundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessement of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community.

Highlights

  • The information contained in the electronic health record for a given patient record is quite redundant

  • We describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, redundancy-aware LDA (Red-LDA), which takes into account the inherent redundancy of clinical notes within a given patient record, and produces better topic models, as shown through quantitative and qualitative evaluation

  • To assess the value of handling redundancy explicitly as part of the topic modeling task of clinical notes, we conducted a comparison of the redundancy-aware LDA (Red-LDA) to alternative methods according to two quantitative established metrics for evaluation of topic modeling – log-likelihood and topic coherence – and a qualitative review of generated topics by clinical experts

Read more

Summary

Introduction

The information contained in the electronic health record for a given patient record is quite redundant. We have shown through a quantitative analysis that redundancy hurts standard text-mining tools, such as collocation identification and topic modeling [3]. Topic Modeling with Latent Dirichlet Allocation (LDA) [4] is a popular unsupervised method for discovering latent semantic properties of a document collection. Topic modeling has been shown to help in large number of tasks, including document classification and clustering, multi-document summarization [5], search [6], document labeling [7,8], and information extraction [9]. The measure of LDAs sensitivity to different kinds of noise is not well understood, especially as various methods are used for evaluating the produced topic models [11,12]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.