Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Raphael Cohen,Noémie Elhadad,Michael Elhadad

doi:10.1186/1471-2105-14-10

Raphael Cohen, Noémie Elhadad + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-14-10

Copy DOI

Abstract

BackgroundThe increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining?ResultsWe analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. aFor text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results.ConclusionsBefore applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.

Highlights

The increasing availability of Electronic Health Record (EHR) data and free-text patient notes presents opportunities for phenotype extraction
Quantifying redundancy in a large-scale EHR corpus Word sequence redundancy at the patient level The first task we address is to define metrics to measure the level of redundancy in a text corpus
We focus on the pre-processed EHR corpus, where named entities are mapped to UMLS Concept Unique Identifiers (CUIs) (Section 4.1.1 describes the automatic mapping method we used)

Summary

Introduction

The increasing availability of Electronic Health Record (EHR) data and free-text patient notes presents opportunities for phenotype extraction. Two promising areas of research in mining the EHR concern phenotype extraction, or more generally the modeling of disease based on clinical documentation [4,5,6] and drug-related discovery [7,8] With these goals in mind, one might want to identify concepts that are associated by looking for frequently co-occurring pairs of concepts or phrases in patient notes, or cluster concepts across patients to identify latent variables corresponding to clinical models. Collocation discovery can help identify lexical variants of medical concepts that are specific to the genre of clinical notes and are not covered by existing terminologies Topic modeling, another text-mining technique, can help cluster terms often mentioned in the same documents across many patients. This technique can bring us one step closer to identifying a set of terms representative of a particular condition, be it symptoms, drugs, comorbidities or even lexical variants of a given condition

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 16, 2013
Citations: 139	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Bibliometric and Text Mining Analysis on COVID-19 Research Projects in Iran
Meisam Dastani ... Mohammad Ghorbani
Depiction of Health | VOL. 12
Meisam Dastani, et. al.Meisam Dastani ... Mohammad Ghorbani
03 Nov 2021
Depiction of Health | VOL. 12

From words to pixels: text and image mining methods for service research
Francisco Villarroel Ordenes ... Shunyuan Zhang
Journal of Service Management | VOL. 30
Francisco Villarroel Ordenes, et. al.Francisco Villarroel Ordenes ... Shunyuan Zhang
09 Oct 2019
Journal of Service Management | VOL. 30

Technical Approach in Text Mining for Stock Market Prediction: A Systematic Review
Mohammad Rabiul Islam ... Rizal Bin Mohd Nor
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 10
Mohammad Rabiul Islam, et. al.Mohammad Rabiul Islam ... Rizal Bin Mohd Nor
01 May 2018
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 10

In Search of Insight from Unstructured Text Data: Towards an Identification of Text Mining Techniques
Sunet Eybers ... Helgard Kahts
-
Sunet Eybers, et. al.Sunet Eybers ... Helgard Kahts
01 Jan 2021
01 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics