Size Selection Techniques Research Articles

Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. A random sample of 200 disclosure statements was prepared for annotation. All "PERSON" and "ORG" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

RNA post-transcriptional modifications play a crucial role in a myriad of biological processes and cellular functions. To date, more than 160 RNA modifications have been discovered; therefore, accurate identification of RNA-modification sites is fundamental for a better understanding of RNA-mediated biological functions and mechanisms. However, due to limitations in experimental methods, systematic identification of different types of RNA-modification sites remains a major challenge. Recently, more than 20 computational methods have been developed to identify RNA-modification sites in tandem with high-throughput experimental methods, with most of these capable of predicting only single types of RNA-modification sites. These methods show high diversity in their dataset size, data quality, core algorithms, features extracted and feature selection techniques and evaluation strategies. Therefore, there is an urgent need to revisit these methods and summarize their methodologies, in order to improve and further develop computational techniques to identify and characterize RNA-modification sites from the large amounts of sequence data. With this goal in mind, first, we provide a comprehensive survey on a large collection of 27 state-of-the-art approaches for predicting N1-methyladenosine and N6-methyladenosine sites. We cover a variety of important aspects that are crucial for the development of successful predictors, including the dataset quality, operating algorithms, sequence and genomic features, feature selection, model performance evaluation and software utility. In addition, we also provide our thoughts on potential strategies to improve the model performance. Second, we propose a computational approach called DeepPromise based on deep learning techniques for simultaneous prediction of N1-methyladenosine and N6-methyladenosine. To extract the sequence context surrounding the modification sites, three feature encodings, including enhanced nucleic acid composition, one-hot encoding, and RNA embedding, were used as the input to seven consecutive layers of convolutional neural networks (CNNs), respectively. Moreover, DeepPromise further combined the prediction score of the CNN-based models and achieved around 43% higher area under receiver-operating curve (AUROC) for m1A site prediction and 2-6% higher AUROC for m6A site prediction, respectively, when compared with several existing state-of-the-art approaches on the independent test. In-depth analyses of characteristic sequence motifs identified from the convolution-layer filters indicated that nucleotide presentation at proximal positions surrounding the modification sites contributed most to the classification, whereas those at distal positions also affected classification but to different extents. To maximize user convenience, a web server was developed as an implementation of DeepPromise and made publicly available at http://DeepPromise.erc.monash.edu/, with the server accepting both RNA sequences and genomic sequences to allow prediction of two types of putative RNA-modification sites.

Size Selection Techniques Research Articles

Articles published on Size Selection Techniques

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

Sonication-assisted liquid exfoliation and size-dependent properties of magnetic two-dimensional α-RuCl3

A new adaptive window‐based guided filtering and interpolation for polarization image demosaicing

Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs

Does a Change in Device Design Alter Device Size Selection? A Comparison of Conventional and Occlutech Duct Occluder Designs.

Deriving Right Sample Size and Choosing an Appropriate Sampling Technique to Select Samples from the Research Population During Ph.D. Program in India

Size Selection and Size‐Dependent Optoelectronic and Electrochemical Properties of 2D Titanium Carbide (Ti3C2Tx) MXene

Problems of the Grid Size Selection in Differential Box-Counting (DBC) Methods and an Improvement Strategy.

Medicolegal Sidebar: Are Implant Sales Reps in the Operating Room Legally Untouchable?

A novel low-cost method for generalized split inverse problem of finite family of demimetric mappings

Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences.

On the Barzilai‐Borwein basic scheme in FFT‐based computational homogenization

Tuning iteration space slicing based tiled multi-core code implementing Nussinov\u2019s RNA folding

(Invited) Liquid-Exfoliated Transition Metal Dichalcogenides: A Story of Excitons, Spectroscopic Metrics and Functionalisation

Habitat type and ambient temperature contribute to bill morphology

Measuring the lateral size of liquid-exfoliated nanosheets with dynamic light scattering

A combined size sorting strategy for monodisperse plasmonic nanostructures

Isolation of Rat Portal Fibroblasts by <em>In situ</em> Liver Perfusion

Isolation of Rat Portal Fibroblasts by <em>In situ</em> Liver Perfusion

Identification by cDNA Cloning of Abundant sRNAs in a Human-Avirulent Yersinia pestis Strain Grown Under Five Different Growth Conditions

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Size Selection Techniques Research Articles

Articles published on Size Selection Techniques

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

Sonication-assisted liquid exfoliation and size-dependent properties of magnetic two-dimensional α-RuCl3

A new adaptive window‐based guided filtering and interpolation for polarization image demosaicing

Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs

Does a Change in Device Design Alter Device Size Selection? A Comparison of Conventional and Occlutech Duct Occluder Designs.

Deriving Right Sample Size and Choosing an Appropriate Sampling Technique to Select Samples from the Research Population During Ph.D. Program in India

Size Selection and Size‐Dependent Optoelectronic and Electrochemical Properties of 2D Titanium Carbide (Ti3C2Tx) MXene

Problems of the Grid Size Selection in Differential Box-Counting (DBC) Methods and an Improvement Strategy.

Medicolegal Sidebar: Are Implant Sales Reps in the Operating Room Legally Untouchable?

A novel low-cost method for generalized split inverse problem of finite family of demimetric mappings

Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences.

On the Barzilai‐Borwein basic scheme in FFT‐based computational homogenization

Tuning iteration space slicing based tiled multi-core code implementing Nussinov\u2019s RNA folding

(Invited) Liquid-Exfoliated Transition Metal Dichalcogenides: A Story of Excitons, Spectroscopic Metrics and Functionalisation

Habitat type and ambient temperature contribute to bill morphology

Measuring the lateral size of liquid-exfoliated nanosheets with dynamic light scattering

A combined size sorting strategy for monodisperse plasmonic nanostructures

Isolation of Rat Portal Fibroblasts by &lt;em&gt;In situ&lt;/em&gt; Liver Perfusion

Isolation of Rat Portal Fibroblasts by &lt;em&gt;In situ&lt;/em&gt; Liver Perfusion

Identification by cDNA Cloning of Abundant sRNAs in a Human-Avirulent Yersinia pestis Strain Grown Under Five Different Growth Conditions

Isolation of Rat Portal Fibroblasts by <em>In situ</em> Liver Perfusion

Isolation of Rat Portal Fibroblasts by <em>In situ</em> Liver Perfusion