Critical assessment of machine learning approaches for classification, dynamic prediction and surrogate Modeling in food fermentation.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Machine learning (ML) is increasingly being used in food science due to its ability to extract insights from large datasets. However, the advantages of ML over traditional mechanistic knowledge-based models remain unclear, especially under the limited data conditions often encountered in food bioprocesses. This study aims to address this gap by critically evaluating supervised ML techniques-specifically decision trees, support vector machines, and neural networks-in comparison to a knowledge-based model (KB), using wine fermentation as a practical, experimental example. We evaluated these approaches in three tasks. Tasks 1 and 2 use time-series fermentation data to (1) classify industrial yeast strains based on their metabolite profiles and (2) predict fermentation dynamics. Task 3 focuses on creating a fast surrogate model using ML techniques applied to synthetic data generated by a mechanistic model. For yeast strain classification, we achieved our highest test accuracy of 74% when utilizing all available metabolite data. In predicting fermentation dynamics, the KB model outperformed the ML models, achieving an average normalized root mean squared error of approximately 6%. The ML models, when additional data was incorporated, had a prediction error of around 7.6%. Lastly, a deep learning surrogate model trained solely on synthetic, mechanistic data demonstrated very low errors (around 0.6%) on test sets, compared to the KB model, while also reducing simulation time by a factor of 30. Our findings highlight the significance of experimental design: although ML models perform well when trained on large and diverse datasets, they often struggle with limited data or when predicting outcomes beyond the conditions observed during training. In contrast, mechanistic models show better generalization and biological interpretability. The complementary nature of both approaches suggests that combining them can lead to more robust, data-informed design and control in complex fermentation systems. Leveraging these complementary strengths, we developed and validated a hybrid model that integrates knowledge-based predictions with a residual neural network to correct systematic errors, reducing overall NRMSE from 6% to 5% and improving prediction for most key compounds.

Similar Papers
  • PDF Download Icon
  • Front Matter
  • Cite Count Icon 2
  • 10.3389/fsysb.2024.1367549
Editorial: Combining mechanistic modeling with machine learning to study multiscale biological processes.
  • Feb 2, 2024
  • Frontiers in systems biology
  • Shayn Peirce-Cottler + 1 more

Combining mechanistic modeling with machine learning to study multiscale biological processes Biological and physiological processes occur across a broad spatiotemporal range, with processes at one level of scale (e.g., gene expression inside single cells) affecting processes at other levels of scale (e.g., coordinated migration of endothelial cells during angiogenesis and tumor growth). Deducing the cause-and-effect relationships that link biological and physiological mechanisms across scales is a major challenge that both machine learning (ML) and mechanistic modeling approaches seek to address. Mechanistic models are particularly well-suited for simulating and/or computing how abstracted, intersecting biological processes give rise to changes over time. Data-driven/machine learning (ML) approaches, such as neural networks and clustering algorithms, on the other hand, integrate massive amounts of data to identify patterns, trends, and correlations in the data. Both methodologies can be used to generate novel insights and testable hypotheses, though the means for doing so differ depending on the modeling approach. Emerging computational strategies are combining mechanistic modeling and ML in ways that capitalize on their unique attributes and compensate for the deficiencies of the other. As discussed in this Research Topic, the resulting synergy created by merging these methods more comprehensively and efficiently leverages large-scale data sets to produce new insights about what biological processes connect across spatial and temporal scales and how they intersect to drive changes in cells, tissues, and organs. Sivakumar et al. provide foundational context for the integration of mechanistic and ML models, focusing on a particular class of the former (namely, agent-based models [ABM]). The authors introduce and explain key concepts, strengths, and limitations of both classes of models, and particularly highlight applications to spatial modeling of biological processes. They note the difficulties inherent in assessing ML models and discuss multiple applications of ML in the context of ABM (e.g., defining and determining agent rules, parameter estimation/model calibration, and reducing the computational cost of ABM). Erdem and Birtwistle present another use case for integrating ML and mechanistic modeling, this time in the context of mining 'omics data to define causal interactions and then integrate these inferences into a mechanistic model. The MEMMAL (MEchanistic

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 67
  • 10.1371/journal.pcbi.1010988
Bridging the gap between mechanistic biological models and machine learning surrogates.
  • Apr 20, 2023
  • PLOS Computational Biology
  • Ioana M Gherman + 5 more

Mechanistic models have been used for centuries to describe complex interconnected processes, including biological ones. As the scope of these models has widened, so have their computational demands. This complexity can limit their suitability when running many simulations or when real-time results are required. Surrogate machine learning (ML) models can be used to approximate the behaviour of complex mechanistic models, and once built, their computational demands are several orders of magnitude lower. This paper provides an overview of the relevant literature, both from an applicability and a theoretical perspective. For the latter, the paper focuses on the design and training of the underlying ML models. Application-wise, we show how ML surrogates have been used to approximate different mechanistic models. We present a perspective on how these approaches can be applied to models representing biological processes with potential industrial applications (e.g., metabolism and whole-cell modelling) and show why surrogate ML models may hold the key to making the simulation of complex biological systems possible using a typical desktop computer.

  • Research Article
  • Cite Count Icon 22
  • 10.1021/acs.molpharmaceut.3c00502
The Comparison of Machine Learning and Mechanistic In Vitro-In Vivo Extrapolation Models for the Prediction of Human Intrinsic Clearance.
  • Oct 9, 2023
  • Molecular Pharmaceutics
  • Christopher E Keefer + 13 more

Accurate prediction of human pharmacokinetics (PK) remains one of the key objectives of drug metabolism and PK (DMPK) scientists in drug discovery projects. This is typically performed by using in vitro-in vivo extrapolation (IVIVE) based on mechanistic PK models. In recent years, machine learning (ML), with its ability to harness patterns from previous outcomes to predict future events, has gained increased popularity in application to absorption, distribution, metabolism, and excretion (ADME) sciences. This study compares the performance of various ML and mechanistic models for the prediction of human IV clearance for a large (645) set of diverse compounds with literature human IV PK data, as well as measured relevant in vitro end points. ML models were built using multiple approaches for the descriptors: (1) calculated physical properties and structural descriptors based on chemical structure alone (classical QSAR/QSPR); (2) in vitro measured inputs only with no structure-based descriptors (ML IVIVE); and (3) in silico ML IVIVE using in silico model predictions for the in vitro inputs. For the mechanistic models, well-stirred and parallel-tube liver models were considered with and without the use of empirical scaling factors and with and without renal clearance. The best ML model for the prediction of in vivo human intrinsic clearance (CLint) was an in vitro ML IVIVE model using only six in vitro inputs with an average absolute fold error (AAFE) of 2.5. The best mechanistic model used the parallel-tube liver model, with empirical scaling factors resulting in an AAFE of 2.8. The corresponding mechanistic model with full in silico inputs achieved an AAFE of 3.3. These relative performances of the models were confirmed with the prediction of 16 Pfizer drug candidates that were not part of the original data set. Results show that ML IVIVE models are comparable to or superior to their best mechanistic counterparts. We also show that ML IVIVE models can be used to derive insights into factors for the improvement of mechanistic PK prediction.

  • Research Article
  • Cite Count Icon 5
  • 10.2118/207877-pa
Application of Machine Learning to Interpret Steady-State Drainage Relative Permeability Experiments
  • Mar 22, 2023
  • SPE Reservoir Evaluation & Engineering
  • Eric Sonny Mathew + 4 more

Summary A meticulous interpretation of steady-state or unsteady-state relative permeability (Kr) experimental data is required to determine a complete set of Kr curves. In this work, different machine learning (ML) models were developed to assist in a faster estimation of these curves from steady-state drainage coreflooding experimental runs. These ML algorithms include gradient boosting (GB), random forest (RF), extreme gradient boosting (XGB), and deep neural network (DNN) with a main focus on and comparison of the two latter algorithms (XGB and DNN). Based on existing mathematical models, a leading-edge framework was developed where a large database of Kr and capillary pressure (Pc) curves were generated. This database was used to perform thousands of coreflood simulation runs representing oil-water drainage steady-state experiments. The results obtained from these simulation runs, mainly pressure drop along with other conventional core analysis data, were used to estimate analytical Kr curves based on Darcy’s law. These analytically estimated Kr curves along with the previously generated Pc curves were fed as features into the ML model. The entire data set was split into 80% for training and 20% for testing. The k-fold cross-validation technique was applied to increase the model’s accuracy by splitting 80% of the training data into 10 folds. In this manner, for each of the 10 experiments, nine folds were used for training and the remaining fold was used for model validation. Once the model was trained and validated, it was subjected to blind testing on the remaining 20% of the data set. The ML model learns to capture fluid flow behavior inside the core from the training data set. In terms of applicability of these ML models, two sets of experimental data were needed as input; the first was the analytically estimated Kr curves from the steady-state drainage coreflooding experiments, while the other was the Pc curves estimated from centrifuge or mercury injection capillary pressure (MICP) measurements. The trained/tested model was then able to estimate Kr curves based on the experimental results fed as input. Furthermore, to test the performance of the ML model when only one set of experimental data is available to an end user, a recurrent neural network (RNN) algorithm was trained/tested to predict Kr curves in the absence of Pc curves as an input. The performance of the three developed models (XGB, DNN, and RNN) was assessed using the values of the coefficient of determination (R2) along with the loss calculated during training/validation of the model. The respective crossplots along with comparisons of ground truth vs. artificial intelligence (AI)-predicted curves indicated that the model is capable of making accurate predictions with an error percentage between 0.2% and 0.6% on history-matching experimental data for all three tested ML techniques. This implies that the AI-based model exhibits better efficiency and reliability in determining Kr curves when compared to conventional methods. The developed ML models by no means replace the need to conduct drainage coreflooding or centrifuge experiments but act as an alternative to existing commercial platforms that are used to interpret experimental data to predict Kr curves. The two main advantages of the developed ML models are their capability of predicting Kr curves within a matter of a few minutes as well as with limited intervention from the end user. The results also include a comparison between classical ML approaches, shallow neural networks, and DNNs in terms of accuracy in predicting the final Kr curves. The research presented here is an extension of the state-of-the-art framework proposed by Mathew et al. (2021). However, the two main aspects of the current study are the application of deep learning for the prediction of Kr curves and the application of feature engineering. The latter not only reduces the training/testing time for the ML models but also enables the end user to obtain the final predictions with the least set of experimental data. The various models discussed in this research work currently focus on the prediction of Kr curves for drainage steady-state experiments; however, the work can be extended to capture the imbibition cycle as well.

  • Research Article
  • Cite Count Icon 10
  • 10.1016/j.compag.2024.108805
A Hybrid Model that Combines Machine Learning and Mechanistic Models for Useful Grass Growth Prediction
  • Mar 9, 2024
  • Computers and Electronics in Agriculture
  • Eoin M Kenny + 3 more

Recently, Machine Learning (ML) has been heralded as a panacea for modelling problems across many domains, including Smart Agriculture (SmartAg), often in opposition to traditional mechanistic models arising on decades of scientific discovery. However, mechanistic models are often successful in “real world” problem-domains where ML models encounter difficulties (e.g., where the distribution of test data is not the same as the training data, violating the so-called identical and independently distributed (i.i.d.) assumption). In this paper, we consider a specific case of this opposition between a mechanistic model of grass growth and a ML model using historical, farm measurements. In our analyses, we find that both types of model have respective strengths. The mechanical model can often handle out-of-distribution events better than ML model, but the ML model can often handle temporary fluctuations in event variables (e.g., changing climate factors). Hence, we propose a combined hybrid model that learns which model to use when predicting grass growth. We argue that this combined approach has several practical benefits in providing stable and accurate predictions under widely varying conditions such as never before seen temperature fluctuations.

  • Peer Review Report
  • Cite Count Icon 3
  • 10.7554/elife.76846.sa2
Author response: Machine learning-assisted discovery of growth decision elements by relating bacterial population dynamics to environmental diversity
  • Jun 8, 2022
  • Honoka Aida + 3 more

Microorganisms growing in their habitat constitute a complex system. How the individual constituents of the environment contribute to microbial growth remains largely unknown. The present study focused on the contribution of environmental constituents to population dynamics via a high-throughput assay and data-driven analysis of a wild-type Escherichia coli strain. A large dataset constituting a total of 12,828 bacterial growth curves with 966 medium combinations, which were composed of 44 pure chemical compounds, was acquired. Machine learning analysis of the big data relating the growth parameters to the medium combinations revealed that the decision-making components for bacterial growth were distinct among various growth phases, e.g., glucose, sulfate, and serine for maximum growth, growth rate, and growth delay, respectively. Further analyses and simulations indicated that branched-chain amino acids functioned as global coordinators for population dynamics, as well as a survival strategy of risk diversification to prevent the bacterial population from undergoing extinction.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.3390/bioengineering10111320
Mathematical and Machine Learning Models of Renal Cell Carcinoma: A Review.
  • Nov 16, 2023
  • Bioengineering
  • Dilruba Sofia + 2 more

This review explores the multifaceted landscape of renal cell carcinoma (RCC) by delving into both mechanistic and machine learning models. While machine learning models leverage patients' gene expression and clinical data through a variety of techniques to predict patients' outcomes, mechanistic models focus on investigating cells' and molecules' interactions within RCC tumors. These interactions are notably centered around immune cells, cytokines, tumor cells, and the development of lung metastases. The insights gained from both machine learning and mechanistic models encompass critical aspects such as signature gene identification, sensitive interactions in the tumors' microenvironments, metastasis development in other organs, and the assessment of survival probabilities. By reviewing the models of RCC, this study aims to shed light on opportunities for the integration of machine learning and mechanistic modeling approaches for treatment optimization and the identification of specific targets, all of which are essential for enhancing patient outcomes.

  • Research Article
  • Cite Count Icon 29
  • 10.1016/j.jhazmat.2023.133196
Machine learning-based water quality prediction using octennial in-situ Daphnia magna biological early warning system data
  • Dec 8, 2023
  • Journal of Hazardous Materials
  • Heewon Jeong + 6 more

Machine learning-based water quality prediction using octennial in-situ Daphnia magna biological early warning system data

  • Research Article
  • Cite Count Icon 73
  • 10.1016/j.memsci.2022.121131
Machine learning prediction on the fractional free volume of polymer membranes
  • Oct 27, 2022
  • Journal of Membrane Science
  • Lei Tao + 4 more

Machine learning prediction on the fractional free volume of polymer membranes

  • Research Article
  • 10.1002/agj2.21733
Rice Haun stage estimation based on mechanistic and machine learning methods
  • Nov 28, 2024
  • Agronomy Journal
  • Guoqing Lei + 8 more

Haun stage (HS), a continuous numerical phenological indicator of cereal crops, is widely used in agronomic management. However, few models have been developed to estimate HS considering the diverse environmental and agronomic influences. In this study, a dataset comprising 2350 HS observations of two rice (Oryza sativa L.) cultivars (Longjing31 and Suijing18) and variables including planting spatiotemporal information, transplanting day of year (TDOY), accumulated air temperature (AcTem), and remote‐sensing vegetation indices (VIs) were collected from 226 field plots. Two mechanistic phenology models, Streck and Phyllochron, and three machine learning (ML) models, including the generalized linear model (GLM), gradient boosting machine (GBM), and deep learning (DL), were developed to predict the HS with different combinations of inputs. The results indicate that three ML models outperformed two mechanistic models, even when using simple spatiotemporal data, the relative root mean square error (RRMSE) decreased by more than 0.023. Especially for GBM and DL models exhibiting similar prediction accuracy (RRMSE from 0.0336 to 0.0543), GBM performs relatively better when VIs are included as input factors. The relative error density distributions (REDDs) of estimated HS in the three ML models were relatively spread out when using limited predictive information of spatiotemporal and VIs, especially during the late rice growth stage and for the Suijing18 cultivar. The inclusion of crop cultivar information enhanced the consistency of REDD, and either VIs or (TDOY, AcTem) provided sufficient information for accurate HS estimation. These findings can provide valuable insights for crop phenology estimation and agronomic practices under varying environments.

  • Research Article
  • Cite Count Icon 214
  • 10.1021/acs.jcim.1c01031
Benchmarking Machine Learning Models for Polymer Informatics: An Example of Glass Transition Temperature.
  • Oct 18, 2021
  • Journal of Chemical Information and Modeling
  • Lei Tao + 2 more

In the field of polymer informatics, utilizing machine learning (ML) techniques to evaluate the glass transition temperature Tg and other properties of polymers has attracted extensive attention. This data-centric approach is much more efficient and practical than the laborious experimental measurements when encountered a daunting number of polymer structures. Various ML models are demonstrated to perform well for Tg prediction. Nevertheless, they are trained on different data sets, using different structure representations, and based on different feature engineering methods. Thus, the critical question arises on selecting a proper ML model to better handle the Tg prediction with generalization ability. To provide a fair comparison of different ML techniques and examine the key factors that affect the model performance, we carry out a systematic benchmark study by compiling 79 different ML models and training them on a large and diverse data set. The three major components in setting up an ML model are structure representations, feature representations, and ML algorithms. In terms of polymer structure representation, we consider the polymer monomer, repeat unit, and oligomer with longer chain structure. Based on that feature, representation is calculated, including Morgan fingerprinting with or without substructure frequency, RDKit descriptors, molecular embedding, molecular graph, etc. Afterward, the obtained feature input is trained using different ML algorithms, such as deep neural networks, convolutional neural networks, random forest, support vector machine, LASSO regression, and Gaussian process regression. We evaluate the performance of these ML models using a holdout test set and an extra unlabeled data set from high-throughput molecular dynamics simulation. The ML model's generalization ability on an unlabeled data set is especially focused, and the model's sensitivity to topology and the molecular weight of polymers is also taken into consideration. This benchmark study provides not only a guideline for the Tg prediction task but also a useful reference for other polymer informatics tasks.

  • Research Article
  • Cite Count Icon 2
  • 10.17762/turcomat.v12i8.3942
Performance Enhancement of Hybrid Algorithm for Bank Telemarketing
  • Apr 20, 2021
  • Turkish Journal of Computer and Mathematics Education (TURCOMAT)
  • Rohan Desai

Telemarketing is an interactive direct marketing system in which telemarketers encourage customers to leverage the resources by notifying, imparting knowledge of online products, latest business offers via direct interaction or through a telephone call. In the contemporary global pandemic spell telemarketing has become dominant backbone to increase the online banking business to withstand for the reducing retail business. It has gained prominance in the banking and financial sector with the enormous adoption and availability of cellular connections amongst customers. The contemporary work has scrutinized conventional classification as well as data mining methods have a problem of ill-fitting with multiple features and are prone to data leakage during re-training of the machine learning model. A local Indian bank were designated, contemplating the current economic slowdown and crisis. A discussion on three machine learning (ML) models is performed along with the Hybrid ML model, Logistic Regression ML model (LR), Naive Bayes ML model (NB), Decision Trees ML model (DTs). The three ML models were tested and analysed with proposed Hybrid ML model on an evaluation set, the data is partitioned as training, validation and test set. The hybrid model first identifies important features of subscribed customers and predicts response for a potential customer, both existing and new who will eventually subscribe again through the direct marketing campaign. The hybrid model is trained to predict the response of new customer who will subscribe to the product or service offered via a direct marketing campaign through transfer learning. The hybrid model API shows new customer response on the front-end screen. To overcome the problem of ill-fitting and data leakage, the model is trained on a large dataset and tuned on a validation set. The proposed hybrid machine learning technique presented the best results (Accuracy 98.69%). Python language is used to develop the model. Financial institutions and organizations can use the hybrid model for predictions of product direct marketing response with customer transaction information.

  • Research Article
  • Cite Count Icon 18
  • 10.1080/13658816.2023.2292570
Act2Loc: a synthetic trajectory generation method by combining machine learning and mechanistic models
  • Dec 12, 2023
  • International Journal of Geographical Information Science
  • Kang Liu + 5 more

Human mobility data play a crucial role in many fields such as infectious diseases, transportation, and public safety. Although the development of Information and Communication Technologies (ICTs) has made it easy to collect individual-level positioning records, raw individual trajectory data are still limited in availability and usability due to privacy issues. Developing models to generate synthetic trajectories that are statistically close to the real data is a promising solution. This study proposed a novel trajectory generation method called Act2Loc (Activity to Location), which combined machine learning and mechanistic models. First, an activity-sequence generation model was constructed based on machine learning models (i.e. K-medoids and Transformer) to generate individual activity sequences aligning with human activity patterns. Then, a spatial-location selection model was proposed based on mechanistic models (e.g. Universal Opportunity model) to explicitly determine the specific locations of the activities in each generated sequence. Experimental results showed that compared to baselines based on purely machine learning or mechanistic models, Act2Loc can better reproduce the spatio-temporal characteristics of the real data, with additional advantage of low data requirements for training, proving its potential for generating synthetic trajectories in practice. This research offers new insights on knowledge-guided GeoAI models for human mobility.

  • Research Article
  • Cite Count Icon 3
  • 10.2519/josptmethods.2024.0086
Evaluation of the Ability of Machine Learning-Models to Assess Postural Orientation Errors During a Single-Leg Squat
  • Jan 1, 2025
  • JOSPT Methods
  • Jenny Älmqvist Nae + 5 more

OBJECTIVES: To reach agreement among experts on visual assessments of postural orientation errors (POEs) during the single-leg squat (SLS), and to use expert agreement assessments as ground truth for machine learning (ML) models to evaluate their ability to classify POEs. DESIGN: Methodological study with mixed-methods design. METHODS: POEs of the lower extremity and trunk were assessed from videos and scored as good, fair, or poor. Three experts visually assessed each repetition for each POE independently and then reached agreement. ML models, one for each POE, were trained to assess POEs, using supervised learning on a subset of videos from the agreement assessment (n = 48). The remaining 99 videos were used to compare the prediction of ML models with the agreement scores (criterion validity), using quadratic weighted kappa (Ƙ), Spearman's correlation coefficient (rs), and accuracy. RESULTS: Machine learning models for the POEs knee medial to foot position (KMFP), femur medial to shank, and femoral valgus showed strong association/substantial agreement with expert agreement scores (rs = 0.566-0.702, Ƙ = 0.58-0.7). Machine learning models for the POEs pelvis and trunk showed moderate association/fair agreement with expert agreement scores (Ƙ = 0.28-0.4, rs = 0.324-0.432), and the POE foot pronation showed no association/agreement (Ƙ = −0.042, rs = −0.05). ML models predicted the expert agreement score in 53% to 78% of the cases. CONCLUSION: Using ML models as a fast and comprehensive assessment of POEs during the SLS shows promising results, the ML models for the POEs KMFP, femur medial to shank, and femoral valgus indicating good validity. Training on larger datasets and/or modifications to some ML models may lead to improvements in model performance. JOSPT Methods 2025;1(1):17-29. Epub 25 November 2024. doi:10.2519/josptmethods.2024.0086

  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.jobe.2024.108836
Hybrid models of machine-learning and mechanistic models for indoor particulate matter concentration prediction
  • Feb 17, 2024
  • Journal of Building Engineering
  • Jihoon Kim + 2 more

Hybrid models of machine-learning and mechanistic models for indoor particulate matter concentration prediction

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant