Stacked multichannel autoencoder – an efficient way of learning from synthetic data
Learning from synthetic data has many important applications in case where sufficient amounts of labeled data are not available. Using synthetic data is challenging due to differences in feature distributions between synthetic and actual data, a phenomenon we term synthetic gap. In this paper, we investigate and formalize a general framework – Stacked Multichannel Autoencoder (SMCAE) that enables bridging the synthetic gap and learning from synthetic data more efficiently. In particular, we show that our SMCAE can not only transform and use synthetic data on a challenging face-sketch recognition task, but that it can also help simulate real images which can be used for training classifiers for recognition. Preliminary experiments validate the effectiveness of the proposed framework.
- Conference Article
23
- 10.1109/icmla.2015.199
- Dec 1, 2015
Learning from synthetic data has many important and practical applications, An example of application is photo-sketch recognition. Using synthetic data is challenging due to the differences in feature distributions between synthetic and real data, a phenomenon we term synthetic gap. In this paper, we investigate and formalize a general framework -- Stacked Multichannel Autoencoder (SMCAE) that enables bridging the synthetic gap and learning from synthetic data more efficiently. In particular, we show that our SMCAE can not only transform and use synthetic data on the challenging face-sketch recognition task, but that it can also help simulate real images, which can be used for training classifiers for recognition. Preliminary experiments validate the effectiveness of the framework.
- Abstract
- 10.1182/blood-2024-209541
- Nov 5, 2024
- Blood
Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology
- Abstract
1
- 10.1182/blood-2022-171057
- Nov 15, 2022
- Blood
Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia
- Research Article
11
- 10.1051/0004-6361/202449865
- Jul 1, 2024
- Astronomy & Astrophysics
Context. Deep learning (DL) techniques are a promising approach among the set of methods used in the ever-challenging determination of stellar parameters in M dwarfs. In this context, transfer learning could play an important role in mitigating uncertainties in the results due to the synthetic gap (i.e. difference in feature distributions between observed and synthetic data). Aims. We propose a feature-based deep transfer learning (DTL) approach based on autoencoders to determine stellar parameters from high-resolution spectra. Using this methodology, we provide new estimations for the effective temperature, surface gravity, metallicity, and projected rotational velocity for 286 M dwarfs observed by the CARMENES survey. Methods. Using autoencoder architectures, we projected synthetic PHOENIX-ACES spectra and observed CARMENES spectra onto a new feature space of lower dimensionality in which the differences between the two domains are reduced. We used this low-dimensional new feature space as input for a convolutional neural network to obtain the stellar parameter determinations. Results. We performed an extensive analysis of our estimated stellar parameters, ranging from 3050 to 4300 K, 4.7 to 5.1 dex, and −0.53 to 0.25 dex for Teff, log 𝑔, and [Fe/H], respectively. Our results are broadly consistent with those of recent studies using CARMENES data, with a systematic deviation in our Teff scale towards hotter values for estimations above 3750 K. Furthermore, our methodology mitigates the deviations in metallicity found in previous DL techniques due to the synthetic gap. Conclusions. We consolidated a DTL-based methodology to determine stellar parameters in M dwarfs from synthetic spectra, with no need for high-quality measurements involved in the knowledge transfer. These results suggest the great potential of DTL to mitigate the differences in feature distributions between the observations and the PHOENIX-ACES spectra.
- Research Article
7
- 10.1190/tle41060392.1
- Jun 1, 2022
- The Leading Edge
This paper discusses the generation of synthetic 3D seismic data for training neural networks to solve a variety of seismic processing, interpretation, and inversion tasks. Using synthetic data is a way to address the shortage of seismic data, which are required for solving problems with machine learning techniques. Synthetic data are built via a simulation process that is based on a mathematical representation of the physics of the problem. In other words, using synthetic data is an indirect way to teach neural networks about the physics of the problem. An important incentive for using synthetic data to solve problems with artificial intelligence methods is that with real seismic data the ground truth is always unknown. When generating synthetic seismic data, we first build the model and then calculate the data, so the answer (model) is always known and always exact. We describe a methodology for generating on-the-fly simulated postmigration (1D modeling) synthetic data in 3D, which are high resolution and look similar to real data. A wide range of models is covered by generating an unlimited number of data examples. The synthetic data are built from impedance models that are constructed through geostatistical simulation of real well logs. With geostatistical simulation, we can describe various geologic variance models in 3D and obtain realistic images. To cover a broad range of scenarios, we need to generalize the seismic data story by randomly perturbing many parameters including structures, conformity styles, dip-strike directions, variograms, measured input logs, frequencies, phase spectra, etc.
- Research Article
3
- 10.1007/s10742-021-00241-z
- Feb 23, 2021
- Health services & outcomes research methodology
While record linkage can expand analyses performable from survey microdata, it also incurs greater risk of privacy-encroaching disclosure. One way to mitigate this risk is to replace some of the information added through linkage with synthetic data elements. This paper describes a case study using the National Hospital Care Survey (NHCS), which collects patient records under a pledge of protecting patient privacy from a sample of U.S. hospitals for statistical analysis purposes. The NHCS data were linked to the National Death Index (NDI) to enhance the survey with mortality information. The added information from NDI linkage enables survival analyses related to hospitalization, but as the death information includes dates of death and detailed causes of death, having it joined with the patient records increases the risk of patient re-identification (albeit only for deceased persons). For this reason, an approach was tested to develop synthetic data that uses models from survival analysis to replace vital status and actual dates-of-death with synthetic values and uses classification tree analysis to replace actual causes of death with synthesized causes of death. The degree to which analyses performed on the synthetic data replicate results from analysis on the actual data is measured by comparing survival analysis parameter estimates from both data files. Because synthetic data only have value to the degree that they can be used to produce statistical estimates that are like those based on the actual data, this evaluation is an essential first step in assessing the potential utility of synthetic mortality data.
- Book Chapter
15
- 10.1007/978-3-031-27077-2_34
- Jan 1, 2023
High-quality tabular data is a crucial requirement for developing data-driven applications, especially healthcare-related ones, because most of the data nowadays collected in this context is in tabular form. However, strict data protection laws complicates the access to medical datasets. Thus, synthetic data has become an ideal alternative for data scientists and healthcare professionals to circumvent such hurdles. Although many healthcare institutions still use the classical de-identification and anonymization techniques for generating synthetic data, deep learning-based generative models such as generative adversarial networks (GANs) have shown a remarkable performance in generating tabular datasets with complex structures. This paper examines the GANs’ potential and applicability within the healthcare industry, which often faces serious challenges with insufficient training data and patient records sensitivity. We investigate several state-of-the-art GAN-based models proposed for tabular synthetic data generation. Healthcare datasets with different sizes, numbers of variables, column data types, feature distributions, and inter-variable correlations are examined. Moreover, a comprehensive evaluation framework is defined to evaluate the quality of the synthetic records and the viability of each model in preserving the patients’ privacy. The results indicate that the proposed models can generate synthetic datasets that maintain the statistical characteristics, model compatibility and privacy of the original data. Moreover, synthetic tabular healthcare datasets can be a viable option in many data-driven applications. However, there is still room for further improvements in designing a perfect architecture for generating synthetic tabular data.
- Research Article
1
- 10.2113/2025/lithosphere_2024_240
- Jul 7, 2025
- Lithosphere
This study presents a parameter optimization strategy for generating synthetic seismic data that closely match the characteristics of target field data, aiming to improve deep learning-based fault detection. An analysis in the latent space is conducted to assess the similarities between synthetic data and target field data. Based on the results from this analysis, we optimize the parameters for generating synthetic data. Further refinement of the data generation process is achieved through the application of Explainable Artificial Intelligence (XAI). The fault interpretation results using the U-Net model trained on optimized synthetic data show significant improvements compared to those from the model trained on unoptimized data. The optimization strategy employed allows for the visualization of feature distributions in the latent space, offering a direct understanding of how the distribution of features shifts depending on the desired parameters. This approach not only circumvents the limitations associated with using field data for training, such as the challenge of acquiring accurate fault structure labels and the scarcity of sufficient training data, but also overcomes the potential discrepancies in interpretation results due to significant deviations in the characteristics of synthetic data from the target field data. The proposed optimization framework improves the performance of deep learning models in fault interpretation and establishes an advanced approach for using synthetic data in deep learning–based seismic interpretation.
- Research Article
9
- 10.32604/cmc.2023.032843
- Jan 1, 2023
- Computers, Materials & Continua
The increasing penetration rate of electric kickboard vehicles has been popularized and promoted primarily because of its clean and efficient features. Electric kickboards are gradually growing in popularity in tourist and education-centric localities. In the upcoming arrival of electric kickboard vehicles, deploying a customer rental service is essential. Due to its free-floating nature, the shared electric kickboard is a common and practical means of transportation. Relocation plans for shared electric kickboards are required to increase the quality of service, and forecasting demand for their use in a specific region is crucial. Predicting demand accurately with small data is troublesome. Extensive data is necessary for training machine learning algorithms for effective prediction. Data generation is a method for expanding the amount of data that will be further accessible for training. In this work, we proposed a model that takes time-series customers’ electric kickboard demand data as input, pre-processes it, and generates synthetic data according to the original data distribution using generative adversarial networks (GAN). The electric kickboard mobility demand prediction error was reduced when we combined synthetic data with the original data. We proposed Tabular-GAN-Modified-WGAN-GP for generating synthetic data for better prediction results. We modified The Wasserstein GAN-gradient penalty (GP) with the RMSprop optimizer and then employed Spectral Normalization (SN) to improve training stability and faster convergence. Finally, we applied a regression-based blending ensemble technique that can help us to improve performance of demand prediction. We used various evaluation criteria and visual representations to compare our proposed model’s performance. Synthetic data generated by our suggested GAN model is also evaluated. The TGAN-Modified-WGAN-GP model mitigates the overfitting and mode collapse problem, and it also converges faster than previous GAN models for synthetic data creation. The presented model’s performance is compared to existing ensemble and baseline models. The experimental findings imply that combining synthetic and actual data can significantly reduce prediction error rates in the mean absolute percentage error (MAPE) of 4.476 and increase prediction accuracy.
- Research Article
- 10.1016/j.compbiomed.2024.109460
- Nov 29, 2024
- Computers in Biology and Medicine
KeyGAN: Synthetic keystroke data generation in the context of digital phenotyping
- Research Article
72
- 10.1177/20539517221145372
- Jan 1, 2023
- Big Data & Society
Machine-learning algorithms have become deeply embedded in contemporary society. As such, ample attention has been paid to the contents, biases, and underlying assumptions of the training datasets that many algorithmic models are trained on. Yet, what happens when algorithms are trained on data that are not real, but instead data that are ‘synthetic’, not referring to real persons, objects, or events? Increasingly, synthetic data are being incorporated into the training of machine-learning algorithms for use in various societal domains. There is currently little understanding, however, of the role played by and the ethicopolitical implications of synthetic training data for machine-learning algorithms. In this article, I explore the politics of synthetic data through two central aspects: first, synthetic data promise to emerge as a rich source of exposure to variability for the algorithm. Second, the paper explores how synthetic data promise to place algorithms beyond the realm of risk. I propose that an analysis of these two areas will help us better understand the ways in which machine-learning algorithms are envisioned in the light of synthetic data, but also how synthetic training data actively reconfigure the conditions of possibility for machine learning in contemporary society.
- Conference Article
106
- 10.1109/ijcnn.2017.7966298
- May 1, 2017
We draw a formal connection between using synthetic training data to optimize neural network parameters and approximate, Bayesian, model-based reasoning. In particular, training a neural network using synthetic data can be viewed as learning a proposal distribution generator for approximate inference in the synthetic-data generative model. We demonstrate this connection in a recognition task where we develop a novel Captcha-breaking architecture and train it using synthetic data, demonstrating both state-of-the-art performance and a way of computing task-specific posterior uncertainty. Using a neural network trained this way, we also demonstrate successful breaking of real-world Captchas currently used by Facebook and Wikipedia. Reasoning from these empirical results and drawing connections with Bayesian modeling, we discuss the robustness of synthetic data results and suggest important considerations for ensuring good neural network generalization when training with synthetic data.
- Research Article
222
- 10.1038/s41598-019-49656-2
- Sep 20, 2019
- Scientific Reports
Most approaches to machine learning from electronic health data can only predict a single endpoint. The ability to simultaneously simulate dozens of patient characteristics is a crucial step towards personalized medicine for Alzheimer’s Disease. Here, we use an unsupervised machine learning model called a Conditional Restricted Boltzmann Machine (CRBM) to simulate detailed patient trajectories. We use data comprising 18-month trajectories of 44 clinical variables from 1909 patients with Mild Cognitive Impairment or Alzheimer’s Disease to train a model for personalized forecasting of disease progression. We simulate synthetic patient data including the evolution of each sub-component of cognitive exams, laboratory tests, and their associations with baseline clinical characteristics. Synthetic patient data generated by the CRBM accurately reflect the means, standard deviations, and correlations of each variable over time to the extent that synthetic data cannot be distinguished from actual data by a logistic regression. Moreover, our unsupervised model predicts changes in total ADAS-Cog scores with the same accuracy as specifically trained supervised models, additionally capturing the correlation structure in the components of ADAS-Cog, and identifies sub-components associated with word recall as predictive of progression.
- Conference Article
29
- 10.1109/cvpr52688.2022.00898
- Jun 1, 2022
Pre-training models on Imagenet or other massive datasets of real images has led to major advances in Computer vision, albeit accompanied with shortcomings related to curation cost, privacy, usage rights, and ethical issues. In this paper, for the first time, we study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks from very different domains. In using such synthetic data for pre-training, we find that downstream performance on different tasks are fa-vored by different configurations of simulation parameters (e.g. lighting, object pose, backgrounds, etc.), and that there is no one-size-fits-all solution. It is thus better to tailor syn-thetic pre-training data to a specific downstream task, for best performance. We introduce Task2Sim, a unified model mapping downstream task representations to optimal sim-ulation parameters to generate synthetic pre-training data for them. Task2Sim learns this mapping by training to find the set of best parameters on a set of "seen" tasks. Once trained, it can then be used to predict best simulation pa-rameters for novel "unseen" tasks in one shot, without re-quiring additional training. Given a budget in number of images per class, our extensive experiments with 20 di-verse downstream tasks show Task2Sim's task-adaptive pre-training data results in significantly better downstream per-formance than non-adaptively choosing simulation param-eters on both seen and unseen tasks. It is even competitive with pre-training on real images from Imagenet.
- Research Article
22
- 10.2196/30697
- Oct 4, 2021
- Journal of medical Internet research
BackgroundComputationally derived (“synthetic”) data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record data. Synthetic data can support data sharing to answer critical research questions to address the COVID-19 pandemic.ObjectiveWe aim to compare the results from analyses of synthetic data to those from original data and assess the strengths and limitations of leveraging computationally derived data for research purposes.MethodsWe used the National COVID Cohort Collaborative’s instance of MDClone, a big data platform with data-synthesizing capabilities (MDClone Ltd). We downloaded electronic health record data from 34 National COVID Cohort Collaborative institutional partners and tested three use cases, including (1) exploring the distributions of key features of the COVID-19–positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-19–related measures and outcomes, and constructing their epidemic curves. We compared the results from synthetic data to those from original data using traditional statistics, machine learning approaches, and temporal and spatial representations of the data.ResultsFor each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. Although the synthetic and original data yielded overall nearly the same results, there were exceptions that included an odds ratio on either side of the null in multivariable analyses (0.97 vs 1.01) and differences in the magnitude of epidemic curves constructed for zip codes with low population counts.ConclusionsThis paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights.