Developing healthcare language model embedding spaces

Niall Taylor,Dan Schofield,Andrey Kormilitzin,Dan W Joyce,Alejo Nevado-Holgado

doi:10.1016/j.artmed.2024.103009

Abstract

Pre-trained Large Language Models (LLMs) have revolutionised Natural Language Processing (NLP) tasks, but often struggle when applied to specialised domains such as healthcare. The traditional approach of pre-training on large datasets followed by task-specific fine-tuning is resource-intensive and poorly aligned with the constraints of many healthcare settings. This presents a significant challenge for deploying LLM-based NLP solutions in medical contexts, where data privacy, computational resources, and domain-specific language pose unique obstacles.This study aims to develop and evaluate efficient methods for adapting smaller LLMs to healthcare-specific datasets and tasks. We seek to identify pre-training approaches that can effectively instil healthcare competency in compact LLMs under tight computational budgets, a crucial capability for responsible and sustainable deployment in local healthcare settings.We explore three specialised pre-training methods to adapt smaller LLMs to different healthcare datasets: traditional Masked Language modelling (MLM), Deep Contrastive Learning for Unsupervised Textual Representations (DeCLUTR), and a novel approach utilising metadata categories from healthcare settings. These methods are assessed across multiple healthcare datasets, with a focus on downstream document classification tasks. We evaluate the performance of the resulting LLMs through classification accuracy and analysis of the derived embedding spaces.Contrastively trained models consistently outperform other approaches on classification tasks, delivering strong performance with limited labelled data and fewer model parameter updates. While our novel metadata-based pre-training does not further improve classifications across datasets, it yields interesting embedding cluster separability. Importantly, all domain-adapted LLMs outperform their publicly available, general-purpose base models, validating the importance of domain specialisation.This research demonstrates the efficacy of specialised pre-training methods in adapting compact LLMs to healthcare tasks, even under resource constraints. We provide guidelines for pre-training specialised healthcare LLMs and motivate continued inquiry into contrastive objectives. Our findings underscore the potential of these approaches for aligning small LLMs with privacy-sensitive medical tasks, offering a path toward more efficient and responsible NLP deployment in healthcare settings. This work contributes to the broader goal of making advanced NLP techniques accessible and effective in specialised domains, particularly where resource limitations and data sensitivity are significant concerns.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Developing healthcare language model embedding spaces

Abstract

Published Version

Talk to us

Similar Papers

More From: Artificial Intelligence In Medicine

Lead the way for us

Journal: Artificial Intelligence In Medicine	Publication Date: Oct 31, 2024
License type: cc-by

Similar Papers

A self-supervised language model selection strategy for biomedical question answering
Negar Arabzadeh ... Ebrahim Bagheri
Journal of Biomedical Informatics | VOL. 146
Negar Arabzadeh, et. al.Negar Arabzadeh ... Ebrahim Bagheri
16 Sep 2023
Journal of Biomedical Informatics | VOL. 146

A Large and Diverse Arabic Corpus for Language Modeling
Abbas Raza Ali ... Hasan Raza Ali
Procedia Computer Science | VOL. 225
Abbas Raza Ali, et. al.Abbas Raza Ali ... Hasan Raza Ali
01 Jan 2023
Procedia Computer Science | VOL. 225

Evaluating large language models for health-related text classification tasks with public social media data.
Yuting Guo ... Abeed Sarker
Journal of the American Medical Informatics Association : JAMIA | VOL. -
Yuting Guo, et. al.Yuting Guo ... Abeed Sarker
09 Aug 2024
Journal of the American Medical Informatics Association : JAMIA | VOL. -

Unraveling the landscape of large language models: a systematic review and future perspectives
Qinxu Ding ... Ding Ding
Journal of Electronic Business & Digital Economics | VOL. 3
Qinxu Ding, et. al.Qinxu Ding ... Ding Ding
19 Dec 2023
Journal of Electronic Business & Digital Economics | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Developing healthcare language model embedding spaces

Abstract

Published Version

Talk to us

Similar Papers

More From: Artificial Intelligence In Medicine