Quality Of Training Data Research Articles

Abstract A vast amount of clinical data are still stored in unstructured text. Automatic extraction of medical information from these data poses several challenges: high costs of clinical expertise, restricted computational resources, strict privacy regulations, and limited interpretability of model predictions. Recent domain adaptation and prompting methods using lightweight masked language models showed promising results with minimal training data and allow for application of well-established interpretability methods. We are first to present a systematic evaluation of advanced domain-adaptation and prompting methods in a lower-resource medical domain task, performing multi-class section classification on German doctor’s letters. We evaluate a variety of models, model sizes (further-pre)training and task settings, and conduct extensive class-wise evaluations supported by Shapley values to validate the quality of small-scale training data and to ensure interpretability of model predictions. We show that in few-shot learning scenarios, a lightweight, domain-adapted pretrained language model, prompted with just 20 shots per section class, outperforms a traditional classification model, by increasing accuracy from $48.6\%$ to $79.1\%$ . By using Shapley values for model selection and training data optimization, we could further increase accuracy up to $84.3\%$ . Our analyses reveal that pretraining of masked language models on general-language data is important to support successful domain-transfer to medical language, so that further-pretraining of general-language models on domain-specific documents can outperform models pretrained on domain-specific data only. Our evaluations show that applying prompting based on general-language pretrained masked language models combined with further-pretraining on medical-domain data achieves significant improvements in accuracy beyond traditional models with minimal training data. Further performance improvements and interpretability of results can be achieved, using interpretability methods such as Shapley values. Our findings highlight the feasibility of deploying powerful machine learning methods in clinical settings and can serve as a process-oriented guideline for lower-resource languages and domains such as clinical information extraction projects.

In recent years, rapid technological advancements have propelled blockchain and artificial intelligence (AI) into prominent roles within the digital industry, each having unique applications. Blockchain, recognized for its secure and transparent data storage, and AI, a powerful tool for data analysis and decision making, exhibit common features that render them complementary. At the same time, machine learning has become a robust and influential technology, adopted by many companies to address non-trivial technical problems. This adoption is fueled by the vast amounts of data generated and utilized in daily operations. An intriguing intersection of blockchain and AI occurs in the realm of federated learning, a distributed approach allowing multiple parties to collaboratively train a shared model without centralizing data. This paper presents a decentralized platform FBLearn for the implementation of federated learning in blockchain, which enables us to harness the benefits of federated learning without the necessity of exchanging sensitive customer or product data, thereby fostering trustless collaboration. As the decentralized blockchain network is introduced in the distributed model training to replace the centralized server, global model aggregation approaches have to be utilized. This paper investigates several techniques for model aggregation based on the local model average and ensemble using either local or globally distributed validation data for model evaluation. The suggested aggregation approaches are experimentally evaluated based on two use cases of the FBLearn platform: credit risk scoring using a random forest classifier and credit card fraud detection using a logistic regression. The experimental results confirm that the suggested adaptive weight calculation and ensemble techniques based on the quality of local training data enhance the robustness of the global model. The performance evaluation metrics and ROC curves prove that the aggregation strategies successfully isolate the influence of the low-quality models on the final model. The proposed system’s ability to outperform models created with separate datasets underscores its potential to enhance collaborative efforts and to improve the accuracy of the final global model compared to each of the local models. Integrating blockchain and federated learning presents a forward-looking approach to data collaboration while addressing privacy concerns.

Quality Of Training Data Research Articles

Related Topics

Articles published on Quality Of Training Data

Clinical information extraction for lower-resource languages and domains with few-shot learning using pretrained language models and prompting

Developing Generalizable Scoring Functions for Molecular Docking: Challenges and Perspectives.

Leveraging Unsupervised Task Adaptation and Semi‐Supervised Learning With Semantic‐Enriched Representations for Online Sexism Detection

Tennis teaching assistance model based on double chain shared unsupervised action recognition algorithm

Batik Pattern Classification Using Machine Learning Approaches

Water Resources’ AI–ML Data Uncertainty Risk and Mitigation Using Data Assimilation

Machine Learning for Advanced Emission Monitoring and Reduction Strategies in Fossil Fuel Power Plants

Latent Diffusion Models to Enhance the Performance of Visual Defect Segmentation Networks in Steel Surface Inspection.

Robust optical picometrology through data diversity

Application of Artificial Intelligence in Ophthalmology: An Updated Comprehensive Review.

FBLearn: Decentralized Platform for Federated Learning on Blockchain

Computer-Simulated Virtual Image Datasets to Train Machine Learning Models for Non-Invasive Fish Detection in Recirculating Aquaculture.

A generative adversarial learning strategy for spatial inspection of compaction quality

Accuracy of machine learning in predicting outcomes post-percutaneous coronary intervention: a systematic review.

Deep learning the Hurst parameter of linear fractional processes and assessing its reliability

Finding core labels for maximizing generalization of graph neural networks

Fault Diagnosis Method for Elevator Carriages Based on Temporal Generative Federated Distillation

Leak Event Diagnosis for Power Plants: Generative Anomaly Detection Using Prototypical Networks.

Analysis of Vina Film Sentiment on Social Media X Using The Naïve Bayes Method

GeDa: Improving training data with large language models for Aspect Sentiment Triplet Extraction

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Quality Of Training Data Research Articles

Related Topics

Articles published on Quality Of Training Data

Clinical information extraction for lower-resource languages and domains with few-shot learning using pretrained language models and prompting

Developing Generalizable Scoring Functions for Molecular Docking: Challenges and Perspectives.

Leveraging Unsupervised Task Adaptation and Semi‐Supervised Learning With Semantic‐Enriched Representations for Online Sexism Detection

Tennis teaching assistance model based on double chain shared unsupervised action recognition algorithm

Batik Pattern Classification Using Machine Learning Approaches

Water Resources’ AI–ML Data Uncertainty Risk and Mitigation Using Data Assimilation

Machine Learning for Advanced Emission Monitoring and Reduction Strategies in Fossil Fuel Power Plants

Latent Diffusion Models to Enhance the Performance of Visual Defect Segmentation Networks in Steel Surface Inspection.

Robust optical picometrology through data diversity

Application of Artificial Intelligence in Ophthalmology: An Updated Comprehensive Review.

FBLearn: Decentralized Platform for Federated Learning on Blockchain

Computer-Simulated Virtual Image Datasets to Train Machine Learning Models for Non-Invasive Fish Detection in Recirculating Aquaculture.

A generative adversarial learning strategy for spatial inspection of compaction quality

Accuracy of machine learning in predicting outcomes post-percutaneous coronary intervention: a systematic review.

Deep learning the Hurst parameter of linear fractional processes and assessing its reliability

Finding core labels for maximizing generalization of graph neural networks

Fault Diagnosis Method for Elevator Carriages Based on Temporal Generative Federated Distillation

Leak Event Diagnosis for Power Plants: Generative Anomaly Detection Using Prototypical Networks.

Analysis of Vina Film Sentiment on Social Media X Using The Naïve Bayes Method

GeDa: Improving training data with large language models for Aspect Sentiment Triplet Extraction