Fine-tuning Large Language Models for Rare Disease Concept Normalization.

Andy Wang,Chunhua Weng,Cong Liu,Jingye Yang

doi:10.1101/2023.12.28.573586

Abstract

We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ~20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLM to identify named medical entities from the clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: bioRxiv : the preprint server for biology	Publication Date: Jun 13, 2024
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Fine-tuning Large Language Models for Rare Disease Concept Normalization.

Abstract

Talk to us

Similar Papers

More From: bioRxiv : the preprint server for biology

Lead the way for us

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Enabling action crossmodality for a pretrained large language model
Anton Caesar ... Stefan Wermter
Natural Language Processing Journal | VOL. 7
Anton Caesar, et. al.Anton Caesar ... Stefan Wermter
20 Apr 2024
Natural Language Processing Journal | VOL. 7

Does one size fit all? Developing an evaluation strategy to assess large language models for patient safety event report analysis.
Allan Fong ... Raj M Ratwani
JAMIA open | VOL. 7
Allan Fong, et. al.Allan Fong ... Raj M Ratwani
08 Oct 2024
JAMIA open | VOL. 7

Use of Generative AI to Identify Helmet Status Among Patients With Micromobility-Related Injuries From Unstructured Clinical Notes
Kathryn G Burford ... Andrew G Rundle
JAMA Network Open | VOL. 7
Kathryn G Burford, et. al.Kathryn G Burford ... Andrew G Rundle
01 Aug 2024
JAMA Network Open | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fine-tuning Large Language Models for Rare Disease Concept Normalization.

Abstract

Talk to us

Similar Papers

More From: bioRxiv : the preprint server for biology