Machine learning approach to synthetic data generation: Uncertainty generative model with neural attention

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract Data scarcity undermines the precision of empirical and analytical research by limiting sample sizes and reducing statistical power. In domains such as business operations, financial management, and information systems, failure data often arise from rare events, introducing substantial aleatoric and epistemic uncertainty. Existing synthetic data generation methods, including interpolation‐based oversampling and generative models, face persistent challenges. They often fail to capture rare events, preserve temporal dependencies, or model multiple sources of uncertainty, leading to unrealistic samples and degraded performance in downstream tasks. This study introduces the uncertainty generative model with neural attention (UGMNA), a synthetic data generation approach that integrates attentive neural processes, the Heston stochastic volatility model, and stochastic differential equations within a continuous‐time latent framework. UGMNA addresses data scarcity by generating synthetic samples that emulate the distributional characteristics of original datasets while explicitly modeling both aleatoric and epistemic uncertainty. Its design enhances statistical power by augmenting limited datasets and ensures that synthetic data reflect key patterns, temporal dynamics, and complex distributions encountered in real‐world scenarios. Experimental results across multiple case studies demonstrate that UGMNA reduces both types of uncertainty while preserving essential data patterns. Compared with conventional baselines and state‐of‐the‐art generators, UGMNA consistently improves predictive accuracy, ranking performance, and model calibration in data‐scarce, high‐variance environments. These findings establish UGMNA as a robust framework for generating reliable synthetic data, offering practical utility for research and decision‐making in contexts where data scarcity and uncertainty hinder model development.

Similar Papers
  • Research Article
  • Cite Count Icon 1
  • 10.1145/3688393
Experience: A Comparative Analysis of Multivariate Time-Series Generative Models: A Case Study on Human Activity Data
  • Sep 30, 2024
  • Journal of Data and Information Quality
  • Naif Alzahrani + 2 more

Human activity recognition (HAR) is an active research field that has seen great success in recent years due to advances in sensory data collection methods and activity recognition systems. Deep artificial intelligence (AI) models have contributed to the success of HAR systems lately, although still suffering from limitations such as data scarcity, the high costs of labelling data instances, and datasets’ imbalance and bias. The temporal nature of human activity data, represented as time series data, impose an additional challenge to using AI models in HAR, because most state-of-the-art models do not account for the time component of the data instances. These limitations have inspired the time-series research community to design generative models for sequential data, but very little work has been done to evaluate the quality of such models. In this work, we conduct a comparative quality analysis of three generative models for time-series data, using a case study in which we aim to generate sensory human activity data from a seed public dataset. Additionally, we adapt and clearly explain four evaluation methods of synthetic time-series data from the literature and apply them to assess the quality of the synthetic activity data we generate. We show experimentally that high-quality human activity data can be generated using deep generative models, and the synthetic data can thus be used in HAR systems to augment real activity data. We also demonstrate that the chosen evaluation methods effectively ensure that the generated data meets the essential quality benchmarks of realism, diversity, coherence, and utility. Our findings suggest that using deep generative models to produce synthetic human activity data can potentially address challenges related to data scarcity, biases, and expensive labeling. This holds promise for enhancing the efficiency and reliability of HAR systems.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1155/2022/7908112
Design of a Financial Accounting Management System Based on a Computer Network
  • Jul 29, 2022
  • Wireless Communications and Mobile Computing
  • Xuanying Zhu

Financial management is one of the most core tasks of a social team and an important link related to the survival of enterprises and the economic development of the country. Therefore, financial management is regarded as a top priority from the national tax and financial system to enterprises and individuals. However, with the rapid development of social economy and the extremely active economic activities in China, especially the complicated capital flows and the rapid changes in the market, even if the financial system is upgraded many times, it is still difficult to meet the requirements of enterprises. In recent years, computer network technology has been more and more widely used in all aspects of social production and life. It can replace people to complete complex, tedious, repetitive, and consumptive work. Therefore, the research and development of financial accounting management system based on computer network technology have important production significance and social value. Firstly, this paper summarizes the development process and current situation of the financial management system, points out the difficulties existing in the current financial accounting management, and investigates the development and application status of computer network technology. Then, it studies the shortcomings of current financial work and the necessity and feasibility of applying computer technology to financial management. Finally, it analyzes the beneficial effects of computer network technology in financial management, designs the overall structure of the financial accounting management system based on computer network technology, studies the characteristics of each subsystem of the structure, points out the technical and institutional challenges faced by computer network technology, and gives corresponding suggestions. This paper is an applied research on the application of advanced computer network technology in financial accounting management, which helps the rapid development of China’s financial accounting management system and provides a combination idea for the application of computer network technology in other traditional industries.

  • Research Article
  • 10.15611/eada.2024.2.01
Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies Data
  • Jan 1, 2024
  • Econometrics
  • Aleksandra Szymura

Aim: The aim of this article was to present and evaluate the concept of synthetic data. They are completely new, artificially generated data, but keep the statistical properties of real data. Due to the statistical similarity with real data, they can be used instead of them. This action allows data to be shared externally while guaranteeing their privacy. Methodology: New datasets were generated based on financial information about Polish limited liability companies, which come from the Orbis database and refer to 2020. To create synthetic data, it was decided to use generative models: CTGAN (based on GAN architecture) and TVAE (based on autoencoders). Lastly, the synthetic data were compared with the real ones in terms of statistical properties (e.g. shape of distributions, correlations etc.) and their applicability in data analysis (the PCA method). Results: The Overall Quality Score was higher for the data generated by TVAE, but after examining the results in more detail, it was seen that the data generated by CTGAN had a better quality in terms of keeping the statistical properties of the real data. Comparing the results of the PCA method, TVAE was better than CTGAN. In addition, the TVAE method was less time-consuming than CTGAN. Implications and recommendations: Before publishing the synthetic data externally, it is recommended that the data are generated using several algorithms, evaluating their final results and finally selecting the best option. This action enables the resulting dataset to be of the highest quality. In further research, it is proposed that other algorithms are tested (e.g. CopulaGAN or TableGAN), in an attempt to deal with some of the realistic data problems that were missed in this analysis, such as missing values (the work was carried out with a complete dataset). Data generated in this study may be used to build financial indicators, which in turn could be used to construct company assessment models. Originality/value: Synthetic data help to deal with some of the data limitations, such as data privacy or scarcity. Due to their statistical similarity with real data, it is possible to use them in advanced machine learning methods instead of real datasets. Analysis on high quality synthetic data allows conclusions similar to analysis on real data to be achieved, while retaining privacy and without publishing sensitive data to third parties.

  • Research Article
  • Cite Count Icon 1
  • 10.1080/17415970903234398
Determining a stable relationship between hedge fund index HFRI-Equity and S&P 500 behaviour, using filtering and maximum likelihood
  • Oct 8, 2009
  • Inverse Problems in Science and Engineering
  • Paolo Capelli + 4 more

In this article we test the ability of the stochastic differential model proposed by Fatone et al. [Maximum likelihood estimation of the parameters of a system of stochastic differential equations that models the returns of the index of some classes of hedge funds, J. Inv. Ill-Posed Probl. 15 (2007), pp. 329–362] of forecasting the returns of a long-short equity hedge fund index and of a market index, that is, of the Hedge Fund Research performance Index (HFRI)-Equity index and of the S&P 500 (Standard & Poor 500 New York Stock Exchange) index, respectively. The model is based on the assumptions that the value of the variation of the log-return of the hedge fund index (HFRI-Equity) is proportional up to an additive stochastic error to the value of the variation of the log-return of a market index (S&P 500) and that the log-return of the market index can be satisfactorily modelled using the Heston stochastic volatility model. The model consists of a system of three stochastic differential equations, two of them are the Heston stochastic volatility model and the third one is the equation that models the behaviour of the hedge fund index and its relation with the market index. The model is calibrated on observed data using a method based on filtering and maximum likelihood proposed by Mariani et al. [Maximum likelihood estimation of the Heston stochastic volatility model using asset and option prices: An application of nonlinear filtering theory, Opt. Lett., 2 (2008), pp. 177–222] and further developed in Fatone et al. [Maximum likelihood estimation of the parameters of a system of stochastic differential equations that models the returns of the index of some classes of hedge funds, J. Inv. Ill-Posed Probl. 15 (2007), pp. 329–362; The calibration of the Heston stochastic volatility model using filtering and maximum likelihood methodsin Proceedings of Dynamic Systems and Applications, Vol. 5, G.S. Ladde, N.G. Medhin, C. Peng, and M. Sambandham, eds., Dynamic Publishers, Atlanta, USA, 2008, pp. 170–181]. That is, an inverse problem for the stochastic dynamical system representing the model is solved using the calibration procedure. The data analysed is from January 1990 to June 2007, and are monthly data. For each observation time, they consist of the value at the observation time of the log-returns of the HFRI-Equity and of the S&P 500 indices. The calibration procedure uses appropriate subsets of data, that is the data observed in a 6 months time period. The 6 months data time period used in the calibration is rolled through the time series generating a sequence of calibration problems. The values of the HFRI-Equity and S&P 500 indices forecasted using the calibrated models are compared to the values of the indices observed. The result of the comparison is very satisfactory. The website http://www.econ.univpm.it/recchioni/finance/w8 contains some auxiliary material including some animations that helps the understanding of this article. A more general reference to the work of some of the authors and of their coauthors in mathematical finance is the website: http://www.econ.univpm.it/recchioni/finance.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.neuroimage.2024.120936
Generative modeling of the Circle of Willis using 3D-StyleGAN
  • Nov 23, 2024
  • NeuroImage
  • Orhun Utku Aydin + 6 more

Generative modeling of the Circle of Willis using 3D-StyleGAN

  • Research Article
  • 10.1016/j.jbi.2025.104948
TransDiffECG: Semantically controllable ECG synthesis via transformer-based diffusion modeling.
  • Dec 1, 2025
  • Journal of biomedical informatics
  • Yuxin Lin + 7 more

TransDiffECG: Semantically controllable ECG synthesis via transformer-based diffusion modeling.

  • Conference Article
  • Cite Count Icon 21
  • 10.1109/cvpr52688.2022.00898
Task2Sim: Towards Effective Pre-training and Transfer from Synthetic Data
  • Jun 1, 2022
  • Samarth Mishra + 7 more

Pre-training models on Imagenet or other massive datasets of real images has led to major advances in Computer vision, albeit accompanied with shortcomings related to curation cost, privacy, usage rights, and ethical issues. In this paper, for the first time, we study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks from very different domains. In using such synthetic data for pre-training, we find that downstream performance on different tasks are fa-vored by different configurations of simulation parameters (e.g. lighting, object pose, backgrounds, etc.), and that there is no one-size-fits-all solution. It is thus better to tailor syn-thetic pre-training data to a specific downstream task, for best performance. We introduce Task2Sim, a unified model mapping downstream task representations to optimal sim-ulation parameters to generate synthetic pre-training data for them. Task2Sim learns this mapping by training to find the set of best parameters on a set of "seen" tasks. Once trained, it can then be used to predict best simulation pa-rameters for novel "unseen" tasks in one shot, without re-quiring additional training. Given a budget in number of images per class, our extensive experiments with 20 di-verse downstream tasks show Task2Sim's task-adaptive pre-training data results in significantly better downstream per-formance than non-adaptively choosing simulation param-eters on both seen and unseen tasks. It is even competitive with pre-training on real images from Imagenet.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.compbiomed.2024.108389
Overcoming data scarcity in radiomics/radiogenomics using synthetic radiomic features
  • Mar 27, 2024
  • Computers in Biology and Medicine
  • Milad Ahmadian + 13 more

Overcoming data scarcity in radiomics/radiogenomics using synthetic radiomic features

  • Research Article
  • 10.3171/2025.4.focus25225
Synthetic neurosurgical data generation with generative adversarial networks and large language models:an investigation on fidelity, utility, and privacy.
  • Jul 1, 2025
  • Neurosurgical focus
  • Austin A Barr + 3 more

Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1155/2022/8578817
Enterprise Financial Management Control System considering Virtual Realization Technology Combined with Comprehensive Budget Management
  • Aug 8, 2022
  • Mobile Information Systems
  • Dan Zhang

Nowadays, the corporate financial control management system has become more perfect with the development of the economy. With the continuous enrichment of financial management systems, enterprises will have to face more and more problems. The modern effective way to solve these financial problems is the combination of financial budget management and financial control. It is very important to combine financial control and financial management from the perspective of collaboration. Taking overall consideration into account, relevant personnel should consider both in the financial management system. They should make effective arrangements from the system and implementation links, so as to effectively prevent all enterprise development crises caused by the control risks. Relevant personnel must strictly control finances from the source and standardize various financial budget management systems. Virtual reality is an emerging digital technology in recent years, which is based on computer networks and transforms traditional paper-based information into a new way that is interactive, understandable, and easy to use. It is different from other things: in the traditional environment, people need to get the content they need through visual. Virtual reality can provide human-computer interconnection and various forms of interactive experiences; it also enables users to update the interface in real time and evaluate the feedback of related products or services, etc. Therefore, virtual reality is widely used in enterprises.

  • Research Article
  • 10.1007/s10845-026-02795-6
Synthetic Data for Predictive Maintenance: A Systematic Review and Framework for Industry 4.0 Applications
  • Jan 28, 2026
  • Journal of Intelligent Manufacturing
  • Walter Nieminen + 6 more

In industrial Predictive Maintenance (PdM), effective data-driven models are often limited by a scarcity of data, dataset imbalance, and the high costs of collecting failure data. By simulating realistic failure scenarios and enhancing model training, the synthetic data generation has emerged as a promising strategy to overcome these challenges. This article is a systematic literature review of 86 peer-reviewed articles published since 2020 that focus on synthetic data applications in medium-to-heavy machinery and industrial processes. Data generation techniques fall into four key categories: data augmentation, generative models, physics-based simulations and hybrid approaches, and feature-based transformations. This review analyzes the strengths, limitations, and adoption trends of each method. Findings reveal that hybrid and physics-informed models are particularly valuable in safety-critical domains where model transparency and adherence to physical laws are essential and industrial contexts demand higher reliability and contextual accuracy. To address these needs, the Synthetic Data-Enhanced PdM (SD-PdM) framework, a five-phase methodology for integrating synthetic data into maintenance strategies, is proposed. This framework supports scalable, explainable, and economically viable smart maintenance solutions.

  • Research Article
  • Cite Count Icon 1
  • 10.2196/53241
Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.
  • Apr 22, 2024
  • JMIR Formative Research
  • Elnaz Karimian Sichani + 3 more

Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data. This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.

  • Research Article
  • Cite Count Icon 44
  • 10.1016/j.elerap.2015.07.001
Intelligent techniques for secure financial management in cloud computing
  • Jul 10, 2015
  • Electronic Commerce Research and Applications
  • Lidia Ogiela

Intelligent techniques for secure financial management in cloud computing

  • PDF Download Icon
  • Research Article
  • 10.32718/nvlvet-e9401
Financial management as a component of an effective management system for an agricultural enterprise in today's challenging environment
  • Jun 26, 2020
  • Scientific Messenger of LNU of Veterinary Medicine and Biotechnologies
  • R M Myniv

Modern trends in financial management, the highest goal of the activity determine the growth of the value of the enterprise and the income of its owners. From this point of view, the financial manager should be seen as an intermediary between the enterprise and the investors, and the entity acts as the “client” of the investors. The concept of financial management combines two categories of “finance” and “management” and directly relates to the business entity. in times of financial crisis, rational management of capital is required, which will enable the use of new management tools that take into account the possibilities of using innovations. In financial management, management aimed at financial recovery of an agricultural enterprise is a system of principles and methods of development and implementation of a set of special management decisions that prevent and overcome the financial crisis, as well as minimize its negative consequences. Financial management of an agricultural enterprise is based on three basic concepts: the concept of present (present) value, the concept of entrepreneurial and the concept of cash flows. Any business can be given as an interconnected system of movement of financial resources caused by management decisions. the content of financial management is the effective use of the financial mechanism - a system of financial management, designed to organize the interaction of financial relations and monetary funds in order to optimize their impact on the final results of the enterprise, which will achieve its strategic and tactical goals. The main tasks of financial management, the scientist refers to: identification of financial sources of production development; definition of effective directions of investment of financial resources; rationalization of operations with securities; establishing optimal relations with the financial and credit system, economic entities. Financial management as a part of the system of effective management of an agricultural enterprise envisages observance of the following principles: adaptability, ie the ability of the financial management system to react actively to changes in the internal and external environment (the principle of dynamism) and to adapt its own activity in accordance with these changes; manageability, that is, subordination to decisions made at the highest level of management; consistency, ie determination of all financial management processes at all levels; optimality, which implies such construction of information flows, organizational support of financial management, which would ensure optimal decision-making process. The defining provisions of the concept of financial management are considered appropriate to include: achievement of maximum social, personal and collective effect; application of synthesis of approaches to the construction of financial management system; allocation of financial management subsystems based on financial management methods; separation of functions of financial management from the point of view of financial resources management; providing a mechanism for close interaction between subsystems and financial management functions; management based on the regulation of a system of balanced indicators characterizing the operation of subsystems and the performance of functions of financial management of an agricultural enterprise

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/icarsc49921.2020.9096166
Overcoming Data Scarcity in Speaker Identification: Dataset Augmentation with Synthetic MFCCs via Character-level RNN
  • Apr 1, 2020
  • Jordan J Bird + 4 more

Autonomous speaker identification suffers issues of data scarcity due to it being unrealistic to gather hours of speaker audio to form a dataset, which inevitably leads to class imbalance in comparison to the large data availability from non-speakers since large-scale speech datasets are available online. In this study, we explore the possibility of improving speaker recognition by augmenting the dataset with synthetic data produced by training a Character-level Recurrent Neural Network on a short clip of five spoken sentences. A deep neural network is trained on a selection of the Flickr8k dataset as well as the real and synthetic speaker data (all in the form of MFCCs) as a binary classification problem in order to discern the speaker from the Flickr speakers. Ranging from 2,500 to 10,000 synthetic data objects, the network weights are then transferred to the original dataset of only Flickr8k and the real speaker data, in order to discern whether useful rules can be learnt from the synthetic data. Results for all three subjects show that fine-tune learning from datasets augmented with synthetic speech improve the classification accuracy, F1 score, precision, and the recall when applied to the scarce real data vs non-speaker data. We conclude that even with just five spoken short sentences, data augmentation via synthetic speech data generated by a Char- RNN can improve the speaker classification process. Accuracy and related metrics are shown to improve from around 93% to 99% for three subjects classified from thousands of others when fine-tuning from exposure to 2500-1000 synthetic data points. High F1 scores, precision and recall also show that issues due to class imbalance are also solved.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.