Insight into machine learning models to predict toxicity of organophosphorus insecticides to Photobacterium phosphoreum based on a small dataset

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Insight into machine learning models to predict toxicity of organophosphorus insecticides to Photobacterium phosphoreum based on a small dataset

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.1117/1.jbo.28.3.036501
Small training dataset convolutional neural networks for application-specific super-resolution microscopy.
  • Mar 14, 2023
  • Journal of biomedical optics
  • Varun Mannam + 1 more

Machine learning (ML) models based on deep convolutional neural networks have been used to significantly increase microscopy resolution, speed [signal-to-noise ratio (SNR)], and data interpretation. The bottleneck in developing effective ML systems is often the need to acquire large datasets to train the neural network. We demonstrate how adding a "dense encoder-decoder" (DenseED) block can be used to effectively train a neural network that produces super-resolution (SR) images from conventional microscopy diffraction-limited (DL) images trained using a small dataset [15 fields of view (FOVs)]. The ML helps to retrieve SR information from a DL image when trained with a massive training dataset. The aim of this work is to demonstrate a neural network that estimates SR images from DL images using modifications that enable training with a small dataset. We employ "DenseED" blocks in existing SR ML network architectures. DenseED blocks use a dense layer that concatenates features from the previous convolutional layer to the next convolutional layer. DenseED blocks in fully convolutional networks (FCNs) estimate the SR images when trained with a small training dataset (15 FOVs) of human cells from the Widefield2SIM dataset and in fluorescent-labeled fixed bovine pulmonary artery endothelial cells samples. Conventional ML models without DenseED blocks trained on small datasets fail to accurately estimate SR images while models including the DenseED blocks can. The average peak SNR (PSNR) and resolution improvements achieved by networks containing DenseED blocks are and , respectively. We evaluated various configurations of target image generation methods (e.g., experimentally captured a target and computationally generated target) that are used to train FCNs with and without DenseED blocks and showed that including DenseED blocks in simple FCNs outperforms compared to simple FCNs without DenseED blocks. DenseED blocks in neural networks show accurate extraction of SR images even if the ML model is trained with a small training dataset of 15 FOVs. This approach shows that microscopy applications can use DenseED blocks to train on smaller datasets that are application-specific imaging platforms and there is promise for applying this to other imaging modalities, such as MRI/x-ray, etc.

  • Research Article
  • Cite Count Icon 15
  • 10.1109/tse.2021.3135465
Making the most of small Software Engineering datasets with modern machine learning
  • Jan 1, 2022
  • IEEE Transactions on Software Engineering
  • Julian Aron Aron Prenner + 1 more

This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. Due to the high costs associated with labeling data, in Software Engineering, there exist many small (< 5,000 samples) and medium-sized (<100,000 samples) datasets. While deep learning has set the state of the art in many machine learning tasks, it is only recently that it has proven effective on small-sized datasets, primarily thanks to pre-training, a semi-supervised learning technique that leverages abundant unlabelled data alongside scarce labelled data. In this work, we evaluate pre-trained Transformer models on a selection of 13 smaller datasets from the SE literature, covering both, source code and natural language. Our results suggest that pre-trained Transformers are competitive and in some cases superior to previous models, especially for tasks involving natural language; whereas for source code tasks, in particular for very small datasets, traditional machine learning methods often has the edge. In addition, we experiment with several techniques that ought to aid training on small datasets, including active learning, data augmentation, soft labels, self-training and intermediate-task fine-tuning, and issue recommendations on when they are effective. We also release all the data, scripts, and most importantly pre-trained models for the community to reuse on their own datasets.

  • Research Article
  • 10.1302/1358-992x.2024.18.057
FRACTURE DETECTION IN WRIST TRAUMA RADIOGRAPH: OPTIMIZING ALGORITHM PERFORMANCE USING TRANSFER LEARNING
  • Nov 14, 2024
  • Orthopaedic Proceedings
  • F Birkholtz + 3 more

IntroductionWith advances in artificial intelligence, the use of computer-aided detection and diagnosis in clinical imaging is gaining traction. Typically, very large datasets are required to train machine-learning models, potentially limiting use of this technology when only small datasets are available. This study investigated whether pretraining of fracture detection models on large, existing datasets could improve the performance of the model when locating and classifying wrist fractures in a small X-ray image dataset. This concept is termed “transfer learning”.MethodFirstly, three detection models, namely, the faster region-based convolutional neural network (faster R-CNN), you only look once version eight (YOLOv8), and RetinaNet, were pretrained using the large, freely available dataset, common objects in context (COCO) (330000 images). Secondly, these models were pretrained using an open-source wrist X-ray dataset called “Graz Paediatric Wrist Digital X-rays” (GRAZPEDWRI-DX) on a (1) fracture detection dataset (20327 images) and (2) fracture location and classification dataset (14390 images). An orthopaedic surgeon classified the small available dataset of 776 distal radius X-rays (Arbeidsgmeischaft für Osteosynthesefragen Foundation / Orthopaedic Trauma Association; AO/OTA), on which the models were tested.ResultDetection models without pre-training on the large datasets were the least precise when tested on the small distal radius dataset. The model with the best accuracy to detect and classify wrist fractures was the YOLOv8 model pretrained on the GRAZPEDWRI-DX fracture detection dataset (mean average precision at intersection over union of 50=59.7%). This model showed up to 33.6% improved detection precision compared to the same models with no pre-training.ConclusionOptimisation of machine-learning models can be challenging when only relatively small datasets are available. The findings of this study support the potential of transfer learning from large datasets to improve model performance in smaller datasets. This is encouraging for wider application of machine-learning technology in medical imaging evaluation, including less common orthopaedic pathologies.

  • Research Article
  • 10.3934/era.2023243
A fair evaluation of the potential of machine learning in maritime transportation
  • Jan 1, 2023
  • Electronic Research Archive
  • Xi Luo + 3 more

&lt;abstract&gt; &lt;p&gt;Machine learning (ML) techniques are extensively applied to practical maritime transportation issues. Due to the difficulty and high cost of collecting large volumes of data in the maritime industry, in many maritime studies, ML models are trained with small training datasets. The relative predictive performances of these trained ML models are then compared with each other and with the conventional model using the same test set. The ML model that performs the best out of the ML models and better than the conventional model on the test set is regarded as the most effective in terms of this prediction task. However, in scenarios with small datasets, this common process may lead to an unfair comparison between the ML and the conventional model. Therefore, we propose a novel process to fairly compare multiple ML models and the conventional model. We first select the best ML model in terms of predictive performance for the validation set. Then, we combine the training and the validation sets to retrain the best ML model and compare it with the conventional model on the same test set. Based on historical port state control (PSC) inspection data, we examine both the common process and the novel process in terms of their ability to fairly compare ML models and the conventional model. The results show that the novel process is more effective at fairly comparing the ML models with the conventional model on different test sets. Therefore, the novel process enables a fair assessment of ML models' ability to predict key performance indicators in the context of limited data availability in the maritime industry, such as predicting the ship fuel consumption and port traffic volume, thereby enhancing their reliability for real-world applications.&lt;/p&gt; &lt;/abstract&gt;

  • Research Article
  • 10.1158/1538-7445.am2024-lb396
Abstract LB396: The power of NetraAI: Precision medicine in oncology through sub-insight learning from small data sets
  • Apr 5, 2024
  • Cancer Research
  • Bessi Qorri + 5 more

The capabilities of artificial intelligence (AI) and machine learning (ML) are pivotal for refining patient stratification and subtype discrimination in clinical trials. Conventional ML methods often rely on large data sets for meaningful discoveries. NetraAI is a novel ML approach designed and trained to work with smaller data sets. The challenge with smaller data sets is that they do not reflect the totality of the disease that they represent. NetraAI employs a novel approach termed “Sub-Insight Learning”, utilizing validated mathematical methods to analyze even small patient data sets. This allows the system to decompose the data sets into high and low confidence patient subpopulations, enhancing predictive model accuracy and reducing overfitting. Further, the system explains what variables are driving the etiology defining the subpopulations of patients. Using two non-small cell lung cancer (NSCLC) data sets (GSE18842 and GSE10245) consisting of only 104 samples from adenocarcinoma (ADC) and squamous cell carcinoma (SCC), NetraAI distinguished the two subtypes through unique genetic signatures. Notably, nine of the ten variables identified correlate with known NSCLC markers, with PIGX emerging as a novel target. Leveraging protein-protein interaction networks (PPI) revealed connections between PIGX and BACE1. BACE1 has been implicated as a driver of NSCLC brain metastasis. These findings shed light on the biology of membrane proteins and their post-translational modifications, a factor implicated in various diseases, prompting further exploration. NetraAI demonstrates a significant breakthrough in precision medicine for oncology, capable of generating meaningful insights from small data sets. The discovery of novel biomarkers and their implications in cancer and other diseases underline the potential of this AI-driven approach in advancing current research paradigms and patient-specific treatments. Citation Format: Bessi Qorri, Mike J. Tsay, Paul Leonchyk, Larry Alphs, Luca Pani, Joseph Geraci. The power of NetraAI: Precision medicine in oncology through sub-insight learning from small data sets [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 2 (Late-Breaking, Clinical Trial, and Invited Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(7_Suppl):Abstract nr LB396.

  • Research Article
  • Cite Count Icon 20
  • 10.1016/s2665-9913(20)30217-4
Making a big impact with small datasets using machine-learning approaches.
  • Aug 1, 2020
  • The Lancet Rheumatology
  • May Y Choi + 1 more

Making a big impact with small datasets using machine-learning approaches.

  • Research Article
  • Cite Count Icon 2
  • 10.1063/5.0214754
Transfer learning for molecular property predictions from small datasets
  • Oct 1, 2024
  • AIP Advances
  • Thorren Kirschbaum + 1 more

Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small datasets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small datasets, for which the best results are obtained with the message passing neural network PaiNN as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large datasets to pre-train the respective models and allows us to obtain more accurate models after fine-tuning on the original datasets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models, and both datasets are normalized to mean zero and standard deviation one to align the labels’ distributions. This study covers two small chemistry datasets, the Harvard Organic Photovoltaics dataset (HOPV, HOMO–LUMO-gaps), for which excellent results are obtained, and the FreeSolv dataset (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.

  • Conference Article
  • 10.2118/213288-ms
An Innovative Machine Learning Method for Predicting Well Performance in Unconventional Reservoirs with a Relatively Small Data Set
  • Mar 7, 2023
  • Hui-Hai Liu + 3 more

The machine learning method, now widely used for predicting well performance from unconventional reservoirs in the industry, generally needs large data sets for model development and training. The large data sets, however, are not always available, especially for newly developed unconventional plays. The objective of this work is to develop an innovative machine learning method for predicting well performance in unconventional reservoirs with a relatively small data set. For a small training data set, the corresponding machine learning model can significantly suffer from so-called overfitting meaning that the model can match the training data but has poor predictivity. To overcome this, our new method averages predictions from multiple models that are developed with the same model input, but different initial guesses of model parameters that are unknowns in a machine learning algorithm and determined in the model training. The averaged results are used for the final model prediction. Unlike traditional ensemble learning methods, each prediction in the new method uses all the input data rather than its subset. We mathematically prove that the averaged prediction provides less model uncertainty and under certain conditions the optimum prediction. It is also demonstrated that the method practically minimizes the overfitting and gives relatively unique prediction. The usefulness of the method is further confirmed by its successful application to the data set collected from less than 100 wells in an unconventional reservoir. Sensitivity results with the trained machine learning model show that the model results are consistent with the domain knowledge regarding the production from the reservoir.

  • Research Article
  • Cite Count Icon 216
  • 10.1038/s41598-018-27344-x
Applying machine learning techniques to predict the properties of energetic materials
  • Jun 13, 2018
  • Scientific Reports
  • Daniel C Elton + 4 more

We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, Bag of Bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with ≈300 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.

  • PDF Download Icon
  • Preprint Article
  • Cite Count Icon 1
  • 10.26434/chemrxiv.5883157.v2
Applying Machine Learning Techniques to Predict the Properties of Energetic Materials
  • Feb 16, 2018
  • Daniel Elton + 4 more

We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, bag of bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with 309 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.

  • Research Article
  • Cite Count Icon 39
  • 10.3390/cancers13061415
Fine-Tuning Approach for Segmentation of Gliomas in Brain Magnetic Resonance Images with a Machine Learning Method to Normalize Image Differences among Facilities
  • Mar 19, 2021
  • Cancers
  • Satoshi Takahashi + 27 more

Simple SummaryThis study evaluates the performance degradation of machine learning models for segmenting gliomas in brain magnetic resonance images caused by domain shift and proposed possible solutions. Although machine learning models exhibit significant potential for clinical applications, performance degradation in different cohorts is a problem that must be solved. In this study, we identify the impact of the performance degradation of machine learning models to be significant enough to render clinical applications difficult. This demonstrates that it can be improved by fine-tuning methods with a small number of cases from each facility, although the data obtained appeared to be biased. Our method creates a facility-specific machine learning model from a small real-world dataset and public dataset; therefore, our fine-tuning method could be a practical solution in situations where only a small dataset is available.Machine learning models for automated magnetic resonance image segmentation may be useful in aiding glioma detection. However, the image differences among facilities cause performance degradation and impede detection. This study proposes a method to solve this issue. We used the data from the Multimodal Brain Tumor Image Segmentation Benchmark (BraTS) and the Japanese cohort (JC) datasets. Three models for tumor segmentation are developed. In our methodology, the BraTS and JC models are trained on the BraTS and JC datasets, respectively, whereas the fine-tuning models are developed from the BraTS model and fine-tuned using the JC dataset. Our results show that the Dice coefficient score of the JC model for the test portion of the JC dataset was 0.779 ± 0.137, whereas that of the BraTS model was lower (0.717 ± 0.207). The mean Dice coefficient score of the fine-tuning model was 0.769 ± 0.138. There was a significant difference between the BraTS and JC models (p < 0.0001) and the BraTS and fine-tuning models (p = 0.002); however, no significant difference between the JC and fine-tuning models (p = 0.673). As our fine-tuning method requires fewer than 20 cases, this method is useful even in a facility where the number of glioma cases is small.

  • Research Article
  • Cite Count Icon 29
  • 10.3390/tomography7020014
A Radiogenomics Ensemble to Predict EGFR and KRAS Mutations in NSCLC
  • Apr 29, 2021
  • Tomography
  • Silvia Moreno + 6 more

Lung cancer causes more deaths globally than any other type of cancer. To determine the best treatment, detecting EGFR and KRAS mutations is of interest. However, non-invasive ways to obtain this information are not available. Furthermore, many times there is a lack of big enough relevant public datasets, so the performance of single classifiers is not outstanding. In this paper, an ensemble approach is applied to increase the performance of EGFR and KRAS mutation prediction using a small dataset. A new voting scheme, Selective Class Average Voting (SCAV), is proposed and its performance is assessed both for machine learning models and CNNs. For the EGFR mutation, in the machine learning approach, there was an increase in the sensitivity from 0.66 to 0.75, and an increase in AUC from 0.68 to 0.70. With the deep learning approach, an AUC of 0.846 was obtained, and with SCAV, the accuracy of the model was increased from 0.80 to 0.857. For the KRAS mutation, both in the machine learning models (0.65 to 0.71 AUC) and the deep learning models (0.739 to 0.778 AUC), a significant increase in performance was found. The results obtained in this work show how to effectively learn from small image datasets to predict EGFR and KRAS mutations, and that using ensembles with SCAV increases the performance of machine learning classifiers and CNNs. The results provide confidence that as large datasets become available, tools to augment clinical capabilities can be fielded.

  • Research Article
  • Cite Count Icon 4
  • 10.1021/jacs.4c06595
Machine Learning in Complex Organic Mixtures: Applying Domain Knowledge Allows for Meaningful Performance with Small Data Sets.
  • Jul 31, 2024
  • Journal of the American Chemical Society
  • Katelyn Le + 4 more

The ability to quantify individual components of complex mixtures is a challenge found throughout the life and physical sciences. An improved capacity to generate large data sets along with the uptake of machine-learning (ML)-based analysis tools has allowed for various "omics" disciplines to realize exceptional advances. Other areas of chemistry that deal with complex mixtures often do not leverage these advances. Environmental samples, for example, can be more difficult to access, and the resulting small data sets are less appropriate for unconstrained ML approaches. Herein, we present an approach to address this latter issue. Using a very small environmental data set─35 high-resolution mass spectra gathered from various solvent extractions of Canadian petroleum fractions─we show that the application of specific domain knowledge can lead to ML models with notable performance.

  • Research Article
  • Cite Count Icon 31
  • 10.1016/j.jbi.2020.103424
Building machine learning models without sharing patient data: A simulation-based analysis of distributed learning by ensembling.
  • Apr 23, 2020
  • Journal of Biomedical Informatics
  • Anup Tuladhar + 3 more

Building machine learning models without sharing patient data: A simulation-based analysis of distributed learning by ensembling.

  • Research Article
  • Cite Count Icon 173
  • 10.1016/j.eswa.2020.113696
Improving classification accuracy using data augmentation on small data sets
  • Jul 15, 2020
  • Expert Systems with Applications
  • Francisco J Moreno-Barea + 2 more

Improving classification accuracy using data augmentation on small data sets

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.