Abstract

Biomedical text clustering is a text mining technique used to provide better document search, browsing, and retrieval in biomedical and clinical text collections. In this research, the document representation based on the concept embedding along with the proposed weighting scheme is explored. The concept embedding is learned through the neural networks to capture the associations between the concepts. The proposed weighting scheme makes use of the concept associations to build document vectors for clustering. We evaluate two types of concept embedding and new weighting scheme for text clustering and visualization on two different biomedical text collections. The returned results demonstrate that the concept embedding along with the new weighting scheme performs better than the baseline tf–idf for clustering and visualization. Based on the internal clustering evaluation metric-Davies–Bouldin index and the visualization, the concept embedding generated from aggregated word embedding can form well-separated clusters, whereas the intact concept embedding can better identify more clusters of specific diseases and gain better F-measure.

Highlights

  • Active research and practice in the medical domain has generated pervasive text files, articles, and documents, which include MEDLINE—the largest biomedical text database, clinical notes in the Electronic Health Records, descriptions of clinical trials, and so on

  • We propose and evaluate a framework for biomedical text clustering and visualization based on the concept embedding of diseases

  • The concept embedding is learned through neural networks

Read more

Summary

Introduction

Active research and practice in the medical domain has generated pervasive text files, articles, and documents, which include MEDLINE—the largest biomedical text database, clinical notes in the Electronic Health Records, descriptions of clinical trials, and so on. Within the biomedical and clinical text files, one medical concept might be represented in different forms or in abbreviations. ‘Diabetes Mellitus Type 2’ could be represented as ‘DM2’ or ‘Type II Diabetes’ in different text files. This happens often in the clinical notes within the Electronic Health Records (EHR), because clinicians have their own preferences of recording notes. Some medical concepts might be highly correlated. ‘Hypertension’ often cooccurs with ‘Stroke.’ the co-occurrences and semantic similarities between

Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call