Abstract

Modern machine learning (ML) technologies have great promise for automating diverse clinical and research workflows; however, training them requires extensive hand-labelled datasets. Disambiguating abbreviations is important for automated clinical note processing; however, broad deployment of ML for this task is restricted by the scarcity and imbalance of labeled training data. In this work we present a method that improves a model’s ability to generalize through novel data augmentation techniques that utilizes information from biomedical ontologies in the form of related medical concepts, as well as global context information within the medical note. We train our model on a public dataset (MIMIC III) and test its performance on automatically generated and hand-labelled datasets from different sources (MIMIC III, CASI, i2b2). Together, these techniques boost the accuracy of abbreviation disambiguation by up to 17% on hand-labeled data, without sacrificing performance on a held-out test set from MIMIC III.

Highlights

  • Modern machine learning (ML) technologies have great promise for automating diverse clinical and research workflows; training them requires extensive hand-labelled datasets

  • A number of supervised machine learning (ML) models have been built for abbreviation disambiguation in medical notes, including ones based on support vector machines (SVM), Naive Bayes classifiers, and neural networks[3,4,5,6]

  • We map medical concepts in unified medical language system (UMLS) to the resulting vector space to generate a word embedding for every medical concept

Read more

Summary

Introduction

Modern machine learning (ML) technologies have great promise for automating diverse clinical and research workflows; training them requires extensive hand-labelled datasets. While disambiguating abbreviations is typically simple for an expert in the field, it is a challenging task for automated processing, which has been addressed by a number of methods going back at least 20 years These methods largely rely on supervised algorithms such as Naive Bayes classifiers trained on co-occurrence counts of senses with automatically tagged medical concepts in biomedical abstracts[1]. Creating handlabeled medical abbreviation datasets to train and test ML models is costly and difficult, and to the best of our knowledge, the only such publicly available dataset with training data and labels is the Clinical Abbreviation Sense Inventory (CASI)[9], which contains just 75 abbreviations The sparsity of these datasets makes methods built based on them vulnerable to overfitting and inapplicable to abbreviations not present in the training data. This was motivated by the idea that acronym expansions are related to the topic of the abstract and that topics can be described by words with the highest TF-IDF weights

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.