Sample Size Of Dataset Research Articles

Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. A random sample of 200 disclosure statements was prepared for annotation. All "PERSON" and "ORG" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

Read full abstract

BackgroundLocal policymakers require information about public health, housing and well-being at small geographical areas. A municipality can for example use this information to organize targeted activities with the aim of improving the well-being of their residents. Surveys are often used to gather data, but many neighborhoods can have only few or even zero respondents. In that case, estimating the status of the local population directly from survey responses is prone to be unreliable.MethodsSmall Area Estimation (SAE) is a technique to provide estimates at small geographical levels with only few or even zero respondents. In classical individual-level SAE, a complex statistical regression model is fitted to the survey responses by using auxiliary administrative data for the population as predictors, the missing responses are then predicted and aggregated to the desired geographical level. In this paper we compare gradient boosted trees (XGBoost), a well-known machine learning technique, to a structured additive regression model (STAR) designed for the specific problem of estimating public health and well-being in the whole population of the Netherlands.ResultsWe compare the accuracy and performance of these models using out-of-sample predictions with five-fold Cross Validation (5CV). We do this for three data sets of different sample sizes and outcome types. Compared to the STAR model, gradient boosted trees are able to improve both the accuracy of the predictions and the total time taken to get these predictions. Even though the models appear quite similar in overall accuracy, the small area predictions at neighborhood level sometimes differ significantly. It may therefore make sense to pursue slightly more accurate models for better predictions into small areas. However, one of the biggest benefits is that XGBoost does not require prior knowledge or model specification. Data preparation and modelling is much easier, since the method automatically handles missing data, non-linear responses, interactions and accounts for spatial correlation structures.ConclusionsIn this paper we provide new nationwide estimates of health, housing and well-being indicators at neighborhood level in the Netherlands, see ’Online materials’. We demonstrate that machine learning provides a good alternative to complex statistical regression modelling for small area estimation in terms of accuracy, robustness, speed and data preparation. These results can be used to make appropriate policy decisions at a local level and make recommendations about which estimation methods are beneficial in terms of accuracy, time and budget constraints.

Read full abstract

Sample Size Of Dataset Research Articles

Related Topics

Articles published on Sample Size Of Dataset

Exploring K-Means Clustering Efficiency: Accuracy and Computational Time across Multiple Datasets

Calibrating and Visualizing Some Bootstrap Confidence Regions

TabDEG: Classifying differentially expressed genes from RNA-seq data based on feature extraction and deep learning framework.

Harmonizing heterogeneous transcriptomics datasets for machine learning-based analysis to identify spaceflown murine liver-specific changes

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

ADCP suspended sediment transport monitoring using acoustic particle radius

Sparse L0-norm least squares support vector machine with feature selection

ADscreen: A speech processing-based screening system for automatic identification of patients with Alzheimer's disease and related dementia

Research on Fire Detection in Laboratories Based on CNN and Transfer Learning

Selection and validation of novel stable reference genes for qPCR analysis in EMT and MET

Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data.

Predicting Human Bioavailability of Subcutaneously Administered Fusion Proteins and Monoclonal Antibodies Using Human Intravenous Clearance or Antibody Isoelectric Point

Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset.

Real-time classification for Φ-OTDR vibration events in the case of small sample size datasets

Data-driven performance analysis of a residential building applying artificial neural network (ANN) and multi-objective genetic algorithm (GA)

A unified model for the sparse optimal scoring problem

Interpreting models interpreting brain dynamics

Fault Classification of Rolling Element Bearing in Machine Learning Domain

Investigating Performance of Composite Quantile Regression with and without Penalization

A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Sample Size Of Dataset Research Articles

Related Topics

Articles published on Sample Size Of Dataset

Exploring K-Means Clustering Efficiency: Accuracy and Computational Time across Multiple Datasets

Calibrating and Visualizing Some Bootstrap Confidence Regions

TabDEG: Classifying differentially expressed genes from RNA-seq data based on feature extraction and deep learning framework.

Harmonizing heterogeneous transcriptomics datasets for machine learning-based analysis to identify spaceflown murine liver-specific changes

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

ADCP suspended sediment transport monitoring using acoustic particle radius

Sparse L0-norm least squares support vector machine with feature selection

ADscreen: A speech processing-based screening system for automatic identification of patients with Alzheimer's disease and related dementia

Research on Fire Detection in Laboratories Based on CNN and Transfer Learning

Selection and validation of novel stable reference genes for qPCR analysis in EMT and MET

Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data.

Predicting Human Bioavailability of Subcutaneously Administered Fusion Proteins and Monoclonal Antibodies Using Human Intravenous Clearance or Antibody Isoelectric Point

Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset.

Real-time classification for Φ-OTDR vibration events in the case of small sample size datasets

Data-driven performance analysis of a residential building applying artificial neural network (ANN) and multi-objective genetic algorithm (GA)

A unified model for the sparse optimal scoring problem

Interpreting models interpreting brain dynamics

Fault Classification of Rolling Element Bearing in Machine Learning Domain

Investigating Performance of Composite Quantile Regression with and without Penalization

A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands