Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

Zoltan P Majdik,S Scott Graham,Justin F Rousseau,Jade C Shiva Edward,Joshua B Barbour,Jared T Jensen,Martha S Karnes,Sabrina N Rodriguez

doi:10.2196/52095

Abstract

Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. A random sample of 200 disclosure statements was prepared for annotation. All "PERSON" and "ORG" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: JMIR AI	Publication Date: May 16, 2024
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

Abstract

Talk to us

Similar Papers

More From: JMIR AI

Lead the way for us

Similar Papers

Sample size and power considerations for ordinary least squares interrupted time series analysis: a simulation study.
Samuel Hawley ... Klara Berencsi
Clinical Epidemiology | VOL. 11
Samuel Hawley, et. al.Samuel Hawley ... Klara Berencsi
01 Feb 2019
Clinical Epidemiology | VOL. 11

Study design and sample size considerations for half-life studies.
M Y Kim ... N Dubin
Archives of environmental contamination and toxicology | VOL. 30
M Y Kim, et. al.M Y Kim ... N Dubin
01 Mar 1996
Archives of environmental contamination and toxicology | VOL. 30

A Call for Qualitative Power Analyses
Anthony J Onwuegbuzie ... Nancy L Leech
Quality & Quantity | VOL. 41
Anthony J Onwuegbuzie, et. al.Anthony J Onwuegbuzie ... Nancy L Leech
01 Feb 2007
Quality & Quantity | VOL. 41

Qualitative Meta-Analysis on the Hospital Task: Implications for Research
Jennifer Noll ... Sashi Sharma
Journal of Statistics Education | VOL. 22
Jennifer Noll, et. al.Jennifer Noll ... Sashi Sharma
01 Jul 2014
Journal of Statistics Education | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

Abstract

Talk to us

Similar Papers

More From: JMIR AI