Predicting equine behavior from small datasets using machine learning with LLM-generated explanations
Predicting equine behavior from small datasets using machine learning with LLM-generated explanations
- Research Article
6
- 10.1117/1.jbo.28.3.036501
- Mar 14, 2023
- Journal of biomedical optics
Machine learning (ML) models based on deep convolutional neural networks have been used to significantly increase microscopy resolution, speed [signal-to-noise ratio (SNR)], and data interpretation. The bottleneck in developing effective ML systems is often the need to acquire large datasets to train the neural network. We demonstrate how adding a "dense encoder-decoder" (DenseED) block can be used to effectively train a neural network that produces super-resolution (SR) images from conventional microscopy diffraction-limited (DL) images trained using a small dataset [15 fields of view (FOVs)]. The ML helps to retrieve SR information from a DL image when trained with a massive training dataset. The aim of this work is to demonstrate a neural network that estimates SR images from DL images using modifications that enable training with a small dataset. We employ "DenseED" blocks in existing SR ML network architectures. DenseED blocks use a dense layer that concatenates features from the previous convolutional layer to the next convolutional layer. DenseED blocks in fully convolutional networks (FCNs) estimate the SR images when trained with a small training dataset (15 FOVs) of human cells from the Widefield2SIM dataset and in fluorescent-labeled fixed bovine pulmonary artery endothelial cells samples. Conventional ML models without DenseED blocks trained on small datasets fail to accurately estimate SR images while models including the DenseED blocks can. The average peak SNR (PSNR) and resolution improvements achieved by networks containing DenseED blocks are and , respectively. We evaluated various configurations of target image generation methods (e.g., experimentally captured a target and computationally generated target) that are used to train FCNs with and without DenseED blocks and showed that including DenseED blocks in simple FCNs outperforms compared to simple FCNs without DenseED blocks. DenseED blocks in neural networks show accurate extraction of SR images even if the ML model is trained with a small training dataset of 15 FOVs. This approach shows that microscopy applications can use DenseED blocks to train on smaller datasets that are application-specific imaging platforms and there is promise for applying this to other imaging modalities, such as MRI/x-ray, etc.
- Research Article
15
- 10.1109/tse.2021.3135465
- Jan 1, 2022
- IEEE Transactions on Software Engineering
This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. Due to the high costs associated with labeling data, in Software Engineering, there exist many small (< 5,000 samples) and medium-sized (<100,000 samples) datasets. While deep learning has set the state of the art in many machine learning tasks, it is only recently that it has proven effective on small-sized datasets, primarily thanks to pre-training, a semi-supervised learning technique that leverages abundant unlabelled data alongside scarce labelled data. In this work, we evaluate pre-trained Transformer models on a selection of 13 smaller datasets from the SE literature, covering both, source code and natural language. Our results suggest that pre-trained Transformers are competitive and in some cases superior to previous models, especially for tasks involving natural language; whereas for source code tasks, in particular for very small datasets, traditional machine learning methods often has the edge. In addition, we experiment with several techniques that ought to aid training on small datasets, including active learning, data augmentation, soft labels, self-training and intermediate-task fine-tuning, and issue recommendations on when they are effective. We also release all the data, scripts, and most importantly pre-trained models for the community to reuse on their own datasets.
- Research Article
- 10.1158/1538-7445.am2024-lb396
- Apr 5, 2024
- Cancer Research
The capabilities of artificial intelligence (AI) and machine learning (ML) are pivotal for refining patient stratification and subtype discrimination in clinical trials. Conventional ML methods often rely on large data sets for meaningful discoveries. NetraAI is a novel ML approach designed and trained to work with smaller data sets. The challenge with smaller data sets is that they do not reflect the totality of the disease that they represent. NetraAI employs a novel approach termed “Sub-Insight Learning”, utilizing validated mathematical methods to analyze even small patient data sets. This allows the system to decompose the data sets into high and low confidence patient subpopulations, enhancing predictive model accuracy and reducing overfitting. Further, the system explains what variables are driving the etiology defining the subpopulations of patients. Using two non-small cell lung cancer (NSCLC) data sets (GSE18842 and GSE10245) consisting of only 104 samples from adenocarcinoma (ADC) and squamous cell carcinoma (SCC), NetraAI distinguished the two subtypes through unique genetic signatures. Notably, nine of the ten variables identified correlate with known NSCLC markers, with PIGX emerging as a novel target. Leveraging protein-protein interaction networks (PPI) revealed connections between PIGX and BACE1. BACE1 has been implicated as a driver of NSCLC brain metastasis. These findings shed light on the biology of membrane proteins and their post-translational modifications, a factor implicated in various diseases, prompting further exploration. NetraAI demonstrates a significant breakthrough in precision medicine for oncology, capable of generating meaningful insights from small data sets. The discovery of novel biomarkers and their implications in cancer and other diseases underline the potential of this AI-driven approach in advancing current research paradigms and patient-specific treatments. Citation Format: Bessi Qorri, Mike J. Tsay, Paul Leonchyk, Larry Alphs, Luca Pani, Joseph Geraci. The power of NetraAI: Precision medicine in oncology through sub-insight learning from small data sets [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 2 (Late-Breaking, Clinical Trial, and Invited Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(7_Suppl):Abstract nr LB396.
- Research Article
- 10.3934/era.2023243
- Jan 1, 2023
- Electronic Research Archive
<abstract> <p>Machine learning (ML) techniques are extensively applied to practical maritime transportation issues. Due to the difficulty and high cost of collecting large volumes of data in the maritime industry, in many maritime studies, ML models are trained with small training datasets. The relative predictive performances of these trained ML models are then compared with each other and with the conventional model using the same test set. The ML model that performs the best out of the ML models and better than the conventional model on the test set is regarded as the most effective in terms of this prediction task. However, in scenarios with small datasets, this common process may lead to an unfair comparison between the ML and the conventional model. Therefore, we propose a novel process to fairly compare multiple ML models and the conventional model. We first select the best ML model in terms of predictive performance for the validation set. Then, we combine the training and the validation sets to retrain the best ML model and compare it with the conventional model on the same test set. Based on historical port state control (PSC) inspection data, we examine both the common process and the novel process in terms of their ability to fairly compare ML models and the conventional model. The results show that the novel process is more effective at fairly comparing the ML models with the conventional model on different test sets. Therefore, the novel process enables a fair assessment of ML models' ability to predict key performance indicators in the context of limited data availability in the maritime industry, such as predicting the ship fuel consumption and port traffic volume, thereby enhancing their reliability for real-world applications.</p> </abstract>
- Research Article
2
- 10.1063/5.0214754
- Oct 1, 2024
- AIP Advances
Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small datasets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small datasets, for which the best results are obtained with the message passing neural network PaiNN as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large datasets to pre-train the respective models and allows us to obtain more accurate models after fine-tuning on the original datasets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models, and both datasets are normalized to mean zero and standard deviation one to align the labels’ distributions. This study covers two small chemistry datasets, the Harvard Organic Photovoltaics dataset (HOPV, HOMO–LUMO-gaps), for which excellent results are obtained, and the FreeSolv dataset (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.
- Research Article
5
- 10.37349/emed.2023.00153
- Jul 26, 2023
- Exploration of Medicine
Aim: Many small datasets of significant value exist in the medical space that are being underutilized. Due to the heterogeneity of complex disorders found in oncology, systems capable of discovering patient subpopulations while elucidating etiologies are of great value as they can indicate leads for innovative drug discovery and development. Methods: Two small non-small cell lung cancer (NSCLC) datasets (GSE18842 and GSE10245) consisting of 58 samples of adenocarcinoma (ADC) and 45 samples of squamous cell carcinoma (SCC) were used in a machine intelligence framework to identify genetic biomarkers differentiating these two subtypes. Utilizing a set of standard machine learning (ML) methods, subpopulations of ADC and SCC were uncovered while simultaneously extracting which genes, in combination, were significantly involved in defining the subpopulations. A previously described interactive hypothesis-generating method designed to work with ML methods was employed to provide an alternative way of extracting the most important combination of variables to construct a new data set. Results: Several genes were uncovered that were previously implicated by other methods. This framework accurately discovered known subpopulations, such as genetic drivers associated with differing levels of aggressiveness within the SCC and ADC subtypes. Furthermore, phyosphatidylinositol glycan anchor biosynthesis, class X (PIGX) was a novel gene implicated in this study that warrants further investigation due to its role in breast cancer proliferation. Conclusions: The ability to learn from small datasets was highlighted and revealed well-established properties of NSCLC. This showcases the utility of ML techniques to reveal potential genes of interest, even from small datasets, shedding light on novel driving factors behind subpopulations of patients.
- Discussion
3
- 10.1016/j.jclinepi.2021.07.019
- Aug 1, 2021
- Journal of Clinical Epidemiology
Prediction models: stepwise development and simultaneous validation is a step back
- Conference Article
- 10.2118/213288-ms
- Mar 7, 2023
The machine learning method, now widely used for predicting well performance from unconventional reservoirs in the industry, generally needs large data sets for model development and training. The large data sets, however, are not always available, especially for newly developed unconventional plays. The objective of this work is to develop an innovative machine learning method for predicting well performance in unconventional reservoirs with a relatively small data set. For a small training data set, the corresponding machine learning model can significantly suffer from so-called overfitting meaning that the model can match the training data but has poor predictivity. To overcome this, our new method averages predictions from multiple models that are developed with the same model input, but different initial guesses of model parameters that are unknowns in a machine learning algorithm and determined in the model training. The averaged results are used for the final model prediction. Unlike traditional ensemble learning methods, each prediction in the new method uses all the input data rather than its subset. We mathematically prove that the averaged prediction provides less model uncertainty and under certain conditions the optimum prediction. It is also demonstrated that the method practically minimizes the overfitting and gives relatively unique prediction. The usefulness of the method is further confirmed by its successful application to the data set collected from less than 100 wells in an unconventional reservoir. Sensitivity results with the trained machine learning model show that the model results are consistent with the domain knowledge regarding the production from the reservoir.
- Research Article
216
- 10.1038/s41598-018-27344-x
- Jun 13, 2018
- Scientific Reports
We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, Bag of Bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with ≈300 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.
- Preprint Article
1
- 10.26434/chemrxiv.5883157.v2
- Feb 16, 2018
We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, bag of bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with 309 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.
- Research Article
6
- 10.3390/app12168252
- Aug 18, 2022
- Applied Sciences
Applying machine learning (ML) and fuzzy inference systems (FIS) requires large datasets to obtain more accurate predictions. However, in the cases of oil spills on ground environments, only small datasets are available. Therefore, this research aims to assess the suitability of ML techniques and FIS for the prediction of the consequences of oil spills on ground environments using small datasets. Consequently, we present a hybrid approach for assessing the suitability of ML (Linear Regression, Decision Trees, Support Vector Regression, Ensembles, and Gaussian Process Regression) and the adaptive neural fuzzy inference system (ANFIS) for predicting the consequences of oil spills with a small dataset. This paper proposes enlarging the initial small dataset of an oil spill on a ground environment by using the synthetic data generated by applying a mathematical model. ML techniques and ANFIS were tested with the same generated synthetic datasets to assess the proposed approach. The proposed ANFIS-based approach shows significant performance and sufficient efficiency for predicting the consequences of oil spills on ground environments with a smaller dataset than the applied ML techniques. The main finding of this paper indicates that FIS is suitable for prediction with a small dataset and provides sufficiently accurate prediction results.
- Research Article
20
- 10.1016/s2665-9913(20)30217-4
- Aug 1, 2020
- The Lancet Rheumatology
Making a big impact with small datasets using machine-learning approaches.
- Research Article
19
- 10.3390/app10238481
- Nov 27, 2020
- Applied Sciences
In many machine learning applications, measurements are sometimes incomplete or noisy resulting in missing features. In other cases, and for different reasons, the datasets are originally small, and therefore, more data samples are required to derive useful supervised or unsupervised classification methods. Correct handling of incomplete, noisy or small datasets in machine learning is a fundamental and classic challenge. In this article, we provide a unified review of recently proposed methods based on signal decomposition for missing features imputation (data completion), classification of noisy samples and artificial generation of new data samples (data augmentation). We illustrate the application of these signal decomposition methods in diverse selected practical machine learning examples including: brain computer interface, epileptic intracranial electroencephalogram signals classification, face recognition/verification and water networks data analysis. We show that a signal decomposition approach can provide valuable tools to improve machine learning performance with low quality datasets.
- Research Article
54
- 10.1007/s11042-020-09637-4
- Aug 19, 2020
- Multimedia Tools and Applications
Skin cancer is one of the most aggressive cancers in the world. Computer-Aided Diagnosis (CAD) system for cancer detection and classification is a top-rated solution that decreases human effort and time with very high classification accuracy. Machine learning (ML) and deep learning (DL) based approaches have been widely used to develop robust skin-lesion classification systems. Each of the techniques excels when the other fails. Their performances are closely related to the size of the learning dataset. Thus, approaches that are based on the ML are less potent than those found on the DL when working with large datasets and vice versa. In this article, we propose a powerful skin-lesion classification approach based on a fusion of handcrafted features (shape, skeleton, color, and texture) and features extracted from most powerful DL architectures. This combination will make it possible to remedy the limitations of both the ML and DL approaches for the case of large and small datasets. Features engineering is then applied to remove redundant features and to select only relevant features. The proposed approach is validated and tested on both small and large datasets. A comparative study is also conducted to compare the proposed approach with different and recent approaches applied to each dataset. The results obtained show that this features-fusion based approach is very promising and can effectively combine the power of ML and DL based approaches.
- Book Chapter
- 10.1007/978-3-030-71704-9_70
- Jan 1, 2021
The application of machine learning (ML) algorithms aim to develop prognostic tools that could be trained on data that is routinely collected. In a typical scenario, the ML algorithm-based prognostic tool is utilized to search through large volumes of data to look for complex relationships in the training data. However, not much attention has been devoted to scenarios where small sample datasets are a widespread occurrence in research areas involving human participants such as clinical trials, genetics, and neuroimaging. In this research, we have studied the impact of the size of the sample dataset on the model performance of different ML algorithms. We compare the model fitting and model prediction performance on the original small dataset and the augmented dataset. Our research has discovered that the model fitted on a small dataset exhibits severe overfitting during the testing stage, which reduces when the model is trained on the augmented dataset. However, to different ML algorithms, the improvement of the model performance due to trained by the augmented dataset may vary.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.