Avoiding big data pitfalls.
Clinical decisions are based on a combination of inductive inference built on experience (ie, statistical models) and on deductions provided by our understanding of the workings of the cardiovascular system (ie, mechanistic models). In a similar way, computers can be used to discover new hidden patterns in the (big) data and to make predictions based on our knowledge of physiology or physics. Surprisingly, unlike humans through history, computers seldom combine inductive and deductive processes. An explosion of expectations surrounds the computer's inductive method, fueled by the "big data" and popular trends. This article reviews the risks and potential pitfalls of this computer approach, where the lack of generality, selection or confounding biases, overfitting, or spurious correlations are among the commonplace flaws. Recommendations to reduce these risks include an examination of data through the lens of causality, the careful choice and description of statistical techniques, and an open research culture with transparency. Finally, the synergy between mechanistic and statistical models (ie, the digital twin) is discussed as a promising pathway toward precision cardiology that mimics the human experience.
- Research Article
2
- 10.1016/j.jmb.2025.169181
- Sep 1, 2025
- Journal of molecular biology
Artificial-intelligence-driven Innovations in Mechanistic Computational Modeling and Digital Twins for Biomedical Applications.
- Research Article
10
- 10.1111/ejss.13011
- Jul 9, 2020
- European Journal of Soil Science
Digital soil mapping (DSM) is an effective mapping technique that supports the increased need for quantitative soil data. In DSM, soil properties are correlated with environmental characteristics using statistical models such as regression. However, many of these relationships are explicitly described in mechanistic simulation models. Therefore, the mechanistic relationships can, in theory, replace the statistical relationships in DSM. This study aims to develop a mechanistic model to predict soil organic matter (SOM) stocks in Natura2000 areas of the Cantabria region (Spain). The mechanistic model is established in four steps: (a) identify major processes that influence SOM stocks, (b) review existing models describing the major processes and the respective environmental data that they require, (c) establish a database with the required input data, and (d) calibrate the model with field observations. The SOM stocks map resulting from the mechanistic model had a mean error (ME) of −2 t SOM ha−1 and a root mean square error (RMSE) of 66 t SOM ha−1. The Lin's concordance correlation coefficient was 0.47 and the amount of variance explained (AVE) was 0.21. The results of the mechanistic model were compared to the results of a statistical model. It turned out that the correlation coefficient between the two SOM stock maps was 0.8. This study illustrated that mechanistic soil models can be used for DSM, which brings new opportunities. Mechanistic models for DSM should be considered for mapping soil characteristics that are difficult to predict by statistical models, and for extrapolation purposes.Highlights Theoretically, mechanistic models can replace the statistical relationships in digital soil mapping. Mechanistic soil models were used to develop a mechanistic model for digital soil mapping that predicted SOM stocks. The applicability of the mechanistic approach needs to be explored for different soil properties and regions.
- Conference Article
- 10.69997/sct.122855
- Jul 1, 2025
A Digital Twin (DT) is a purposeful digital representation of a physical entity that employs data, algorithms, and software to enhance operations, making it possible to e.g., forecast failures, or evaluate new designs through the simulation of real-world scenarios. DTs are enablers for real-time monitoring, simulation, and optimization. However, traditional simulation DTs often rely on complex, non-linear mechanistic models with high computational demands, complex structures, and a large number of specific parameters and thus pose quite a challenge to maintainability. Surrogate models, on the other hand, are simplified approximations of more complex, higher-order models. These approximations are typically built using data-driven approaches, such as Random Forest Regression, facilitating faster simulations, simpler adaptation, and quicker deployment. This study analyzes the complexity of mechanistic and surrogate modeling approaches in the context of DTs to aid model selection. A model with reduced complexity enhances computational efficiency, simplifies implementation, and supports real-time monitoring and predictive maintenance. Complexity analysis evaluates metrics such as analytical, structural, space, behavioral, training, and prediction complexity, resulting in an overall complexity score for model selection. However, the decision involves trade-offs, such as balancing high fidelity with low complexity or prioritizing high explainability over structural simplicity. Addressing these trade-offs is essential in selecting a model that balances the accuracy, usability, and efficiency of DTs. Using a stirred tank reactor as a use case, the mechanistic model is compared to a surrogate model to quantify complexity scores and select a less complex model for DT development.
- Supplementary Content
12
- 10.1007/s00399-024-01014-0
- Jan 1, 2024
- Herzschrittmachertherapie & Elektrophysiologie
Cardiac arrhythmias remain a major cause of death and disability. Current antiarrhythmic therapies are effective to only a limited extent, likely in large part due to their mechanism-independent approach. Precision cardiology aims to deliver targeted therapy for an individual patient to maximize efficacy and minimize adverse effects. In-silico digital twins have emerged as a promising strategy to realize the vision of precision cardiology. While there is no uniform definition of a digital twin, it typically employs digital tools, including simulations of mechanistic computer models, based on patient-specific clinical data to understand arrhythmia mechanisms and/or make clinically relevant predictions. Digital twins have become part of routine clinical practice in the setting of interventional cardiology, where commercially available services use digital twins to non-invasively determine the severity of stenosis (computed tomography-based fractional flow reserve). Although routine clinical application has not been achieved for cardiac arrhythmia management, significant progress towards digital twins for cardiac electrophysiology has been made in recent years. At the same time, significant technical and clinical challenges remain. This article provides a short overview of the history of digital twins for cardiac electrophysiology, including recent applications for the prediction of sudden cardiac death risk and the tailoring of rhythm control in atrial fibrillation. The authors highlight the current challenges for routine clinical application and discuss how overcoming these challenges may allow digital twins to enable a significant precision medicine-based advancement in cardiac arrhythmia management.
- Single Report
- 10.2172/1881930
- Sep 12, 2022
Since each cancer has its own unique characteristics, each one can respond differently to the same treatments. Therefore, the creation of a digital twin (DT) of cancer can assist us in predicting the evolution of an individual's cancer through modeling each tumor's characteristics and response to treatment. Hence, we propose to take advantage of new advances in computational approaches and combine mechanistic, machine learning, and stochastic modeling approaches to create “My Virtual Cancer", a DT platform. To establish a personalized DT, we use patient-specific data for parameter estimations, sensitivity analysis, and uncertainty quantification. For each patient, we will estimate the values of parameters of their QSP model using the patient's data. We perform a multi-dimensional sensitivity analysis and uncertainty quantification on the mechanistic model to find a set of critical interactions and predict the intervals of confidence. Since this QSP model includes the data-driven mechanistic model of cells and molecules' interaction networks, one of the ultimate results of this DT would be the prediction of evolution of tumors.
- Book Chapter
17
- 10.1007/978-3-030-78307-5_14
- Jan 1, 2022
This chapter presents a Digital Twin Pipeline Framework of the COGNITWIN project that supports Hybrid and Cognitive Digital Twins, through four Big Data and AI pipeline steps adapted for Digital Twins. The pipeline steps are Data Acquisition, Data Representation, AI/Machine learning, and Visualisation and Control. Big Data and AI Technology selections of the Digital Twin system are related to the different technology areas in the BDV Reference Model. A Hybrid Digital Twin is defined as a combination of a data-driven Digital Twin with First-order Physical models. The chapter illustrates the use of a Hybrid Digital Twin approach by describing an application example of Spiral Welded Steel Industrial Machinery maintenance, with a focus on the Digital Twin support for Predictive Maintenance. A further extension is in progress to support Cognitive Digital Twins includes support for learning, understanding, and planning, including the use of domain and human knowledge. By using digital, hybrid, and cognitive twins, the project’s presented pilot aims to reduce energy consumption and average duration of machine downtimes. Data-driven artificial intelligence methods and predictive analytics models that are deployed in the Digital Twin pipeline have been detailed with a focus on decreasing the machinery’s unplanned downtime. We conclude that the presented pipeline can be used for similar cases in the process industry.
- Research Article
19
- 10.1016/j.ijme.2013.05.001
- Jun 17, 2013
- The International Journal of Management Education
From theory to practice: Teaching management using films through deductive and inductive processes
- Supplementary Content
37
- 10.1007/s40471-016-0078-4
- Jan 1, 2016
- Current Epidemiology Reports
The dynamics of infectious disease epidemics are driven by interactions between individuals with differing disease status (e.g., susceptible, infected, immune). Mechanistic models that capture the dynamics of such “dependent happenings” are a fundamental tool of infectious disease epidemiology. Recent methodological advances combined with access to new data sources and computational power have resulted in an explosion in the use of dynamic models in the analysis of emerging and established infectious diseases. Increasing use of models to inform practical public health decision making has challenged the field to develop new methods to exploit available data and appropriately characterize the uncertainty in the results. Here, we discuss recent advances and areas of active research in the mechanistic and dynamic modeling of infectious disease. We highlight how a growing emphasis on data and inference, novel forecasting methods, and increasing access to “big data” are changing the field of infectious disease dynamics. We showcase the application of these methods in phylodynamic research, which combines mechanistic models with rich sources of molecular data to tie genetic data to population-level disease dynamics. As dynamics and mechanistic modeling methods mature and are increasingly tied to principled statistical approaches, the historic separation between the infectious disease dynamics and “traditional” epidemiologic methods is beginning to erode; this presents new opportunities for cross pollination between fields and novel applications.
- Preprint Article
- 10.26434/chemrxiv-2025-r70bs-v2
- Mar 27, 2025
Deriving versatile and robust mechanistic models from experimental data is a key challenge in engineering and natural sciences. This is especially true in chemical reaction engineering, where reactor manufacturers and operators increasingly pursue the development and maintenance of digital twins that rely on frequent model updates and ask for automation of this modelling process. In this work, we propose an automated workflow that generates accurate mechanistic reactor models from experimental concentration data of a given reactor. At the core of this workflow, a reinforcement learning agent assembles an interpretable reactor model by iteratively simplifying general differential balance equations and fitting the resulting candidate model to experimental data. We demonstrate the performance of our workflow in two case studies. An in silico case study shows that the workflow correctly reconstructs the model underlying a synthetic data set, is robust against noise in the input data, and has favourable scaling properties. The agent accelerates the model derivation process significantly compared to an exhaustive enumerative search. Secondly, an experimental case study is conducted employing a Taylor-Couette prototype reactor. A liquid-phase esterification reaction of (2-bromophenyl)methanol and acetic anhydride was used as a test system. Based on the experimental data, the workflow derives meaningful mechanistic models, with the most accurate model showing a normalized root mean squared error of 2.4%. Future work encompasses the integration of automated experiments into the workflow and the transfer of our workflow to process units beyond chemical reactors.
- Research Article
144
- 10.1016/j.resconrec.2019.06.002
- Aug 8, 2019
- Resources, Conservation and Recycling
Digital twins probe into food cooling and biochemical quality changes for reducing losses in refrigerated supply chains
- Research Article
85
- 10.1080/00405000108659564
- Jan 1, 2001
- The Journal of The Textile Institute
Prediction of yarn properties from fibre properties and process parameters is a well-researched topic. For a number of years, mechanistic and statistical models have primarily been used to tackle the problem. Over the last ten years, neural networks have been used in increasing numbers for this purpose. However, a comparative assessment of the performance of these three approaches has not been forthcoming. In this paper, all the three models have been applied on the data available to validate the mechanistic model described by Frydrych (pertaining to cotton yarns). The exercise was repeated for data pertaining to yarns spun from polyester staple fibre in the laboratory. The results conclusively prove the superiority of neural networks over mechanistic models and simple regression equations for predicting ring yarn tenacity from fibre properties and process parameters.
- Research Article
27
- 10.1038/s41746-024-01188-4
- Jul 16, 2024
- npj Digital Medicine
Virtual patients and digital patients/twins are two similar concepts gaining increasing attention in health care with goals to accelerate drug development and improve patients’ survival, but with their own limitations. Although methods have been proposed to generate virtual patient populations using mechanistic models, there are limited number of applications in immuno-oncology research. Furthermore, due to the stricter requirements of digital twins, they are often generated in a study-specific manner with models customized to particular clinical settings (e.g., treatment, cancer, and data types). Here, we discuss the challenges for virtual patient generation in immuno-oncology with our most recent experiences, initiatives to develop digital twins, and how research on these two concepts can inform each other.
- Research Article
14
- 10.1093/comnet/cnz024
- Aug 2, 2019
- Journal of complex networks
Network models are applied across many domains where data can be represented as a network. Two prominent paradigms for modelling networks are statistical models (probabilistic models for the observed network) and mechanistic models (models for network growth and/or evolution). Mechanistic models are better suited for incorporating domain knowledge, to study effects of interventions (such as changes to specific mechanisms) and to forward simulate, but they typically have intractable likelihoods. As such, and in a stark contrast to statistical models, there is a relative dearth of research on model selection for such models despite the otherwise large body of extant work. In this article, we propose a simulator-based procedure for mechanistic network model selection that borrows aspects from Approximate Bayesian Computation along with a means to quantify the uncertainty in the selected model. To select the most suitable network model, we consider and assess the performance of several learning algorithms, most notably the so-called Super Learner, which makes our framework less sensitive to the choice of a particular learning algorithm. Our approach takes advantage of the ease to forward simulate from mechanistic network models to circumvent their intractable likelihoods. The overall process is flexible and widely applicable. Our simulation results demonstrate the approach's ability to accurately discriminate between competing mechanistic models. Finally, we showcase our approach with a protein-protein interaction network model from the literature for yeast (Saccharomyces cerevisiae).
- Research Article
1
- 10.3897/biss.7.112373
- Sep 11, 2023
- Biodiversity Information Science and Standards
The Biodiversity Digital Twin (BioDT) project (2022-2025) aims to create prototypes that integrate various data sets, models, and expert domain knowledge enabling prediction capabilities and decision-making support for critical issues in biodiversity dynamics. While digital twin concepts have been applied in industries for continuous monitoring of physical phenomena, their application in biodiversity and environmental sciences presents novel challenges (Bauer et al. 2021, de Koning et al. 2023). In addition, successfully developing digital twins for biodiversity requires addressing interoperability challenges in data standards. BioDT is developing prototype digital twins based on use cases that span various data complexities, from point occurrence data to bioacoustics, covering nationwide forest states to specific communities and individual species. The project relies on FAIR principles (Findable, Accessible, Interoperable, and Reusable) and FAIR enabling resources like standards and vocabularies (Schultes et al. 2020) to enable the exchange, sharing, and reuse of biodiversity information, fostering collaboration among participating research infrastructures (DiSSCo, eLTER, GBIF, and LifeWatch) and data providers. It also involves creating a harmonised abstraction layer using Persistent Identifiers (PID) and FAIR Digital Object (FDO) records, alongside semantic mapping and crosswalk techniques to provide machine-actionable metadata (Schultes and Wittenburg 2019, Schwardmann 2020). Governance and engagement with research infrastructure stakeholders play crucial roles in this regard, with a focus on aligning technical and data standards discussions. In addition to data, models and workflows are key elements in BioDT. Models in the BioDT context are formal representations of problems or processes, implemented through equations, algorithms, or a combination of both, which can be executed by machine entities. The current twin prototypes are considering both statistical and mechanistic models, introducing significant variations in (1) data requirements, (2) modelling approaches and philosophy, and (3) model output. The BioDT consortium will develop guidelines and protocols for how to describe these models, what metadata to include, and how they will interact with the diverse datasets. While discussions on this topic exist within the broader context of biodiversity and ecological sciences (Jeltsch et al. 2013, Fer et al. 2020), the BioDT project is strongly committed to finding a solution within its scope. In the twinning context, data and models need to be executed within a computing infrastructure and also need to adhere to FAIR principles. Software within BioDT includes a suite of tools that facilitate data acquisition, storage, processing, and analysis. While some of these tools already exist, the challenge lies in integrating them within the digital twinning framework. One approach to achieving integration is through workflow representation, encompassing standardised procedures and protocols that guide the acquisition, packaging, processing, and analysis of data. The project is exploring Research Object Crate (RO-Crate) implementation for this (Soiland-Reyes et al. 2022). Implementing workflows can ensure reproducibility, scalability, and transparency in research practices, enabling scientists to validate and replicate findings. The BioDT project offers a novel and transformative approach to biodiversity research and application. By leveraging collaborative research infrastructures and adhering to data standards, BioDT aims to harness the power of data, software, supercomputers, models, and expertise to provide new insights. The foundation provided by the data standards, including those of Biodiversity Information Standards (TDWG), is crucial in realising the full potential of digital twins, facilitating the seamless integration of diverse data sources and combinations with models.
- Research Article
- Aug 28, 2023
- ArXiv
Despite the remarkable advances in cancer diagnosis, treatment, and management that have occurred over the past decade, malignant tumors remain a major public health problem. Further progress in combating cancer may be enabled by personalizing the delivery of therapies according to the predicted response for each individual patient. The design of personalized therapies requires patient-specific information integrated into an appropriate mathematical model of tumor response. A fundamental barrier to realizing this paradigm is the current lack of a rigorous, yet practical, mathematical theory of tumor initiation, development, invasion, and response to therapy. In this review, we begin by providing an overview of different approaches to modeling tumor growth and treatment, including mechanistic as well as data-driven models based on “big data” and artificial intelligence. Next, we present illustrative examples of mathematical models manifesting their utility and discussing the limitations of stand-alone mechanistic and data-driven models. We further discuss the potential of mechanistic models for not only predicting, but also optimizing response to therapy on a patient-specific basis. We then discuss current efforts and future possibilities to integrate mechanistic and data-driven models. We conclude by proposing five fundamental challenges that must be addressed to fully realize personalized care for cancer patients driven by computational models.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.