Small Training Set Research Articles

Recent advances in digital technologies have lowered the costs and improved the quality of digital pathology Whole Slide Images (WSI), opening the door to apply Machine Learning (ML) techniques to assist in cancer diagnosis. ML, including Deep Learning (DL), has produced impressive results in diverse image classification tasks in pathology, such as predicting clinical outcomes in lung cancer and inferring regional gene expression signatures. Despite these promising results, the uptake of ML as a common diagnostic tool in pathology remains limited. A major obstacle is the insufficient labelled data for training neural networks and other classifiers, especially for new sites where models have not been established yet. Recently, image synthesis from small, labelled datasets using Generative Adversarial Networks (GAN) has been used successfully to create high-performing classification models. Considering the domain shift and complexity in annotating data, we investigated an approach based on GAN that minimized the differences in WSI between large public data archive sites and a much smaller data archives at the new sites. The proposed approach allows the tuning of a deep learning classification model for the class of interest to be improved using a small training set available at the new sites. This paper utilizes GAN with the one-class classification concept to model the class of interest data. This approach minimizes the need for large amounts of labelled data from the new site to train the network. The GAN generates synthesized one-class WSI images to jointly train the classifier with WSIs available from the new sites. We tested the proposed approach for follicular lymphoma data of a new site by utilizing the data archives from different sites. The synthetic images for the one-class data generated from the data obtained from different sites with minimum amount of data from the new site have resulted in a significant improvement of 15% for the Area Under the curve (AUC) for the new site that we want to establish a new follicular lymphoma classifier. The test results have shown that the classifier can perform well without the need to obtain more training data from the test site, by utilizing GAN to generate the synthetic data from all existing data in the archives from all the sites.

Fetal alcohol syndrome (FAS) is a lifelong developmental disability that occurs among individuals with prenatal alcohol exposure (PAE). With improved prediction models, FAS can be diagnosed or treated early, if not completely prevented. In this study, we sought to compare different machine learning algorithms and their FAS predictive performance among women who consumed alcohol during pregnancy. We also aimed to identify which variables (eg, timing of exposure to alcohol during pregnancy and type of alcohol consumed) were most influential in generating an accurate model. Data from the collaborative initiative on fetal alcohol spectrum disorders from 2007 to 2017 were used to gather information about 595 women who consumed alcohol during pregnancy at 5 hospital sites around the United States. To obtain information about PAE, questionnaires or in-person interviews, as well as reviews of medical, legal, or social service records were used to gather information about alcohol consumption. Four different machine learning algorithms (logistic regression, XGBoost, light gradient-boosting machine, and CatBoost) were trained to predict the prevalence of FAS at birth, and model performance was measured by analyzing the area under the receiver operating characteristics curve (AUROC). Of the total cases, 80% were randomly selected for training, while 20% remained as test data sets for predicting FAS. Feature importance was also analyzed using Shapley values for the best-performing algorithm. Overall, there were 20 cases of FAS within a total population of 595 individuals with PAE. Most of the drinking occurred in the first trimester only (n=491) or throughout all 3 trimesters (n=95); however, there were also reports of drinking in the first and second trimesters only (n=8), and 1 case of drinking in the third trimester only (n=1). The CatBoost method delivered the best performance in terms of AUROC (0.92) and area under the precision-recall curve (AUPRC 0.51), followed by the logistic regression method (AUROC 0.90; AUPRC 0.59), the light gradient-boosting machine (AUROC 0.89; AUPRC 0.52), and XGBoost (AUROC 0.86; AURPC 0.45). Shapley values in the CatBoost model revealed that 12 variables were considered important in FAS prediction, with drinking throughout all 3 trimesters of pregnancy, maternal age, race, and type of alcoholic beverage consumed (eg, beer, wine, or liquor) scoring highly in overall feature importance. For most predictive measures, the best performance was obtained by the CatBoost algorithm, with an AUROC of 0.92, precision of 0.50, specificity of 0.29, F1 score of 0.29, and accuracy of 0.96. Machine learning algorithms were able to identify FAS risk with a prediction performance higher than that of previous models among pregnant drinkers. For small training sets, which are common with FAS, boosting mechanisms like CatBoost may help alleviate certain problems associated with data imbalances and difficulties in optimization or generalization.

Small Training Set Research Articles

Related Topics

Articles published on Small Training Set

Convolutional neural network for classifying cartoon images augmented by DCGAN

Fine-grained Species Recognition with Privileged Pooling: Better Sample Efficiency Through Supervised Attention.

Sample-Efficient Cardinality Estimation Using Geometric Deep Learning

Analog RF Circuit Sizing by a Cascade of Shallow Neural Networks

Efficient Exploration of Chemical Compound Space Using Active Learning for Prediction of Thermodynamic Properties of Alkane Molecules.

MahaEmoSen: Towards Emotion-aware Multimodal Marathi Sentiment Analysis

A prediction model for blood-brain barrier penetrating peptides based on masked peptide transformers with dynamic routing.

Block sparsity promoting algorithm for efficient construction of cluster expansion models for multicomponent alloys

Towards detection of cancer biomarkers in human exhaled air by transfer-learning-powered analysis of odor-evoked calcium activity in rat olfactory bulb

Accelerating the characterization of dynamic DNA origami devices with deep neural networks

Oracle-based data generation for highly efficient digital twin network training

A supervised approach for the detection of AM-FM signals’ interference regions in spectrogram images

A new computationally efficient method to tune BERT networks – transfer learning

Improving short text classification with augmented data using GPT-3

Conjugated quantitative structure-property relationship models: Prediction of kinetic characteristics linked by the Arrhenius equation.

Representative Data Selection for Efficient Medical Incremental Learning.

The use of generative adversarial networks for multi-site one-class follicular lymphoma classification

Application of a wavelength angle mapper for variable selection in iterative optimization technology predictions of drug content in pharmaceutical powder mixtures

Predicting Fetal Alcohol Spectrum Disorders Using Machine Learning Techniques: Multisite Retrospective Cohort Study.

CellSighter: a neural network to classify cells in highly multiplexed images

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Small Training Set Research Articles

Related Topics

Articles published on Small Training Set

Convolutional neural network for classifying cartoon images augmented by DCGAN

Fine-grained Species Recognition with Privileged Pooling: Better Sample Efficiency Through Supervised Attention.

Sample-Efficient Cardinality Estimation Using Geometric Deep Learning

Analog RF Circuit Sizing by a Cascade of Shallow Neural Networks

Efficient Exploration of Chemical Compound Space Using Active Learning for Prediction of Thermodynamic Properties of Alkane Molecules.

MahaEmoSen: Towards Emotion-aware Multimodal Marathi Sentiment Analysis

A prediction model for blood-brain barrier penetrating peptides based on masked peptide transformers with dynamic routing.

Block sparsity promoting algorithm for efficient construction of cluster expansion models for multicomponent alloys

Towards detection of cancer biomarkers in human exhaled air by transfer-learning-powered analysis of odor-evoked calcium activity in rat olfactory bulb

Accelerating the characterization of dynamic DNA origami devices with deep neural networks

Oracle-based data generation for highly efficient digital twin network training

A supervised approach for the detection of AM-FM signals’ interference regions in spectrogram images

A new computationally efficient method to tune BERT networks – transfer learning

Improving short text classification with augmented data using GPT-3

Conjugated quantitative structure-property relationship models: Prediction of kinetic characteristics linked by the Arrhenius equation.

Representative Data Selection for Efficient Medical Incremental Learning.

The use of generative adversarial networks for multi-site one-class follicular lymphoma classification

Application of a wavelength angle mapper for variable selection in iterative optimization technology predictions of drug content in pharmaceutical powder mixtures

Predicting Fetal Alcohol Spectrum Disorders Using Machine Learning Techniques: Multisite Retrospective Cohort Study.

CellSighter: a neural network to classify cells in highly multiplexed images