KeyGAN: Synthetic keystroke data generation in the context of digital phenotyping

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

KeyGAN: Synthetic keystroke data generation in the context of digital phenotyping

Similar Papers
  • Abstract
  • 10.1182/blood-2024-209541
Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology
  • Nov 5, 2024
  • Blood
  • Saverio D'Amico + 41 more

Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology

  • Research Article
  • Cite Count Icon 48
  • 10.1016/j.isprsjprs.2023.05.015
Utilization of synthetic minority oversampling technique for improving potato yield prediction using remote sensing data and machine learning algorithms with small sample size of yield data
  • May 24, 2023
  • ISPRS Journal of Photogrammetry and Remote Sensing
  • Hamid Ebrahimy + 2 more

Utilization of synthetic minority oversampling technique for improving potato yield prediction using remote sensing data and machine learning algorithms with small sample size of yield data

  • Research Article
  • Cite Count Icon 21
  • 10.1109/jbhi.2023.3236722
Characterization of Synthetic Health Data Using Rule-Based Artificial Intelligence Models.
  • Aug 1, 2023
  • IEEE Journal of Biomedical and Health Informatics
  • Marta Lenatti + 4 more

The aim of this study is to apply and characterize eXplainable AI (XAI) to assess the quality of synthetic health data generated using a data augmentation algorithm. In this exploratory study, several synthetic datasets are generated using various configurations of a conditional Generative Adversarial Network (GAN) from a set of 156 observations related to adult hearing screening. A rule-based native XAI algorithm, the Logic Learning Machine, is used in combination with conventional utility metrics. The classification performance in different conditions is assessed: models trained and tested on synthetic data, models trained on synthetic data and tested on real data, and models trained on real data and tested on synthetic data. The rules extracted from real and synthetic data are then compared using a rule similarity metric. The results indicate that XAI may be used to assess the quality of synthetic data by (i) the analysis of classification performance and (ii) the analysis of the rules extracted on real and synthetic data (number, covering, structure, cut-off values, and similarity). These results suggest that XAI can be used in an original way to assess synthetic health data and extract knowledge about the mechanisms underlying the generated data.

  • Dissertation
  • 10.14264/a76d4cd
Automated microseismic event detection with machine learning
  • Oct 8, 2021
  • The University of Queensland
  • Zhengguang Zhao

Microseismic monitoring is essential to image and map hydraulic fractures during and after the hydraulic fracturing stimulations for unconventional oil and gas reservoirs. Insights into the underlying reservoir geology and structure can be obtained and effectiveness of hydraulic fracturing engineering parameters can be evaluated through located microseismic events and interpreted hydraulic fractures. There are many important steps in a typical microseismic data processing workflow, including preprocessing, microseismic event detection, microseismic event location, etc. These steps can be implemented either automatically or manually. Automatic microseismic event detection is of particular interest in this thesis. Automatic microseismic event detection involves algorithms and/or workflows to discriminate genuine microseismic events, either P-wave events or S-wave events or both of them, from noise. From algorithm or workflow perspective, microseismic event detection methods can be classified into three major categories, including arrival-time picking, migration-based and waveform-based detection. Most of state-of-the-art arrival-time picking and migration-based methods are characteristic function and threshold based. Limitations in these traditional methods are that user-defined threshold imposes too much impact on the detection accuracy and inappropriate pre-set threshold is prone to bring about low detection accuracy, especially if the signal-to-noise ratio (SNR) of a given microseismic dataset is relatively low.Recently, machine learning and deep learning based methods have been investigated to overcome the drawbacks of these traditional physical model based methods. This thesis aims to develop a workflow that leverages the support vector machine (SVM) classifier to realize automatic microseismic event detection and investigate how to train a robust SVM classifier in order to improve the microseismic event detection accuracy. Here, a classifier is considered to be robust if its performance has the following property: it achieves “similar” performance on a testing sample and a training sample that are “close”. In this thesis, we proposed a “Classification Is Detection” strategy, where a machine learning based approach, specifically SVM classifier referred to as microseismic event detector (MED), was used to distinguish genuine microseismic events from noise. Thus, microseismic detection was cast as a supervised classification. Experiments in this work indicated a well-trained MED is able to achieve comparable, if not better, event detection accuracy with traditional methods.To improve the detection accuracy of a MED, enhanced feature engineering was investigated. We added more 1D features, including time, frequency and multi-channel domain features, into existing feature set published by other researchers. These features were referred to as “ZZ features” in this work. The multi-channel domain features, for example cross-correlation, proved to be effective in improving event detection accuracy. We introduced matched filter analysis (MFA) to enhance the 2D features through firstly applying matched filter to the low to ultra-low SNR dataset and then extracting 2D features from the MFA data. The results indicated that a MED trained with 2D features extracted from MFA data obtained higher detection accuracy than one trained with 2D features extracted from raw data, especially when low to ultra-low SNR dataset was presented. We also studied the impact of SNR on feature selection by carrying out many experiments with variable-SNR training datasets.These experiments indicated that 2D features were important for all training sets, regardless of their SNR, however, 1D features gained more importance weights when training a SVM using features extracted from higher-SNR training sets. As 2D features were more important to train a robust MED, we investigated if adding more 2D features, for example 2D features extracted from raw data, will improve the MED performance. The result suggested that 2D features in ZZ features were sufficient to obtain a robust MED. Lastly, the impact of SNR discrepancy between training and test sets on MED performance was investigated. It was found that a MED can only perform well when it was trained and tested with similar noise level datasets.In practice, both P-wave events and S-wave events are present in a individual seismic trace and an event detection algorithm needs to differentiate these two phases in order to feed them to following location processes with different velocity models and wave travel times. To further differentiate P-wave and S-wave, we leveraged the existing multiclass SVM classifier to cast the two-phase microseismic event detection problem into a multiclass SVM classification problem. In multiclass classification, we introduced the multivariate time series (MTS) concept to take the 3C microseismic data as MTS data and ZZ features were expanded into 3C-ZZR features by extracting ZZ features from X, Y and Z component of raw training dataset. We next used both One-vs-Rest (OVR) and One-vs-One (OVO) strategies to train and test multiclass SVM classifiers. Both synthetic and field examples indicated that multiclass SVM classifiers were still able to achieve acceptable detection accuracy, while the overall event detection performance cannot compete with the aforementioned binary SVM classifiers. During this course, we also found that 1D features gained more feature importance in feature selection process and a MED trained with 1D features only was able to achieve comparable performance with a MED trained with both 1D and 2D features in 3C-ZZR features. As mentioned, both the training and test data are either synthetic or field data in all of previous examples. However, it is a MED trained with synthetic data and tested on field data that is of particular interest to the industry. To find out if a MED trained by synthetic data can perform well on field data, we carried out feasibility study of applying a MED trained by synthetic data to field data. We compared the experiment in which a MED was trained with white Gaussian noise (WGN) polluted synthetic data and tested on field data with the experiment in which a MED was trained with field noise polluted synthetic data and tested on field data. It is found that the MED trained with ZZ features extracted from WGN polluted synthetic data achieved comparable high event detection accuracy with the MED trained with ZZ features extracted from field noise polluted synthetic data, though both of these MEDs cannot compete with the MED trained with field data in terms of event detection accuracy. Furthermore, the results of experiments in which less features were used in the training phase indicated that a MED trained with field noise polluted synthetic data was superior to a MED trained with WGN in terms of event detection accuracy.The machine learning based microseismic event detection methods and the robust MEDs developed and presented in this thesis can be utilised as standalone or concurrent microseismic event detection processes within a standard microseismic data processing workflow. These MEDs provides remedies to the aforementioned limitations of the conventional characteristic function and threshold based methods. Contributions made in this thesis offer an improvement on existing automatic microseismic event detection techniques and offer a new avenue for future research. Some of the key improvements provided by this research are that we developed a new feature set that was able to develop a robust MED and obtain improved event detection accuracy and we found a SVM classifier trained with features extracted from field noise polluted synthetic data can achieve comparable event detection accuracy with a classifier trained by field data.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s11042-018-5879-7
Stacked multichannel autoencoder – an efficient way of learning from synthetic data
  • Apr 16, 2018
  • Multimedia Tools and Applications
  • Xi Zhang + 5 more

Learning from synthetic data has many important applications in case where sufficient amounts of labeled data are not available. Using synthetic data is challenging due to differences in feature distributions between synthetic and actual data, a phenomenon we term synthetic gap. In this paper, we investigate and formalize a general framework – Stacked Multichannel Autoencoder (SMCAE) that enables bridging the synthetic gap and learning from synthetic data more efficiently. In particular, we show that our SMCAE can not only transform and use synthetic data on a challenging face-sketch recognition task, but that it can also help simulate real images which can be used for training classifiers for recognition. Preliminary experiments validate the effectiveness of the proposed framework.

  • Conference Article
  • Cite Count Icon 23
  • 10.1109/icmla.2015.199
Learning from Synthetic Data Using a Stacked Multichannel Autoencoder
  • Dec 1, 2015
  • Xi Zhang + 4 more

Learning from synthetic data has many important and practical applications, An example of application is photo-sketch recognition. Using synthetic data is challenging due to the differences in feature distributions between synthetic and real data, a phenomenon we term synthetic gap. In this paper, we investigate and formalize a general framework -- Stacked Multichannel Autoencoder (SMCAE) that enables bridging the synthetic gap and learning from synthetic data more efficiently. In particular, we show that our SMCAE can not only transform and use synthetic data on the challenging face-sketch recognition task, but that it can also help simulate real images, which can be used for training classifiers for recognition. Preliminary experiments validate the effectiveness of the framework.

  • Research Article
  • Cite Count Icon 1
  • 10.2113/2025/lithosphere_2024_240
Synthetic Training Data Optimization for Enhanced Fault Detection in Seismic Images
  • Jul 7, 2025
  • Lithosphere
  • Woochang Choi + 2 more

This study presents a parameter optimization strategy for generating synthetic seismic data that closely match the characteristics of target field data, aiming to improve deep learning-based fault detection. An analysis in the latent space is conducted to assess the similarities between synthetic data and target field data. Based on the results from this analysis, we optimize the parameters for generating synthetic data. Further refinement of the data generation process is achieved through the application of Explainable Artificial Intelligence (XAI). The fault interpretation results using the U-Net model trained on optimized synthetic data show significant improvements compared to those from the model trained on unoptimized data. The optimization strategy employed allows for the visualization of feature distributions in the latent space, offering a direct understanding of how the distribution of features shifts depending on the desired parameters. This approach not only circumvents the limitations associated with using field data for training, such as the challenge of acquiring accurate fault structure labels and the scarcity of sufficient training data, but also overcomes the potential discrepancies in interpretation results due to significant deviations in the characteristics of synthetic data from the target field data. The proposed optimization framework improves the performance of deep learning models in fault interpretation and establishes an advanced approach for using synthetic data in deep learning–based seismic interpretation.

  • Book Chapter
  • Cite Count Icon 15
  • 10.1007/978-3-031-27077-2_34
Generation of Synthetic Tabular Healthcare Data Using Generative Adversarial Networks
  • Jan 1, 2023
  • Alireza Hossein Zadeh Nik + 3 more

High-quality tabular data is a crucial requirement for developing data-driven applications, especially healthcare-related ones, because most of the data nowadays collected in this context is in tabular form. However, strict data protection laws complicates the access to medical datasets. Thus, synthetic data has become an ideal alternative for data scientists and healthcare professionals to circumvent such hurdles. Although many healthcare institutions still use the classical de-identification and anonymization techniques for generating synthetic data, deep learning-based generative models such as generative adversarial networks (GANs) have shown a remarkable performance in generating tabular datasets with complex structures. This paper examines the GANs’ potential and applicability within the healthcare industry, which often faces serious challenges with insufficient training data and patient records sensitivity. We investigate several state-of-the-art GAN-based models proposed for tabular synthetic data generation. Healthcare datasets with different sizes, numbers of variables, column data types, feature distributions, and inter-variable correlations are examined. Moreover, a comprehensive evaluation framework is defined to evaluate the quality of the synthetic records and the viability of each model in preserving the patients’ privacy. The results indicate that the proposed models can generate synthetic datasets that maintain the statistical characteristics, model compatibility and privacy of the original data. Moreover, synthetic tabular healthcare datasets can be a viable option in many data-driven applications. However, there is still room for further improvements in designing a perfect architecture for generating synthetic tabular data.

  • Research Article
  • Cite Count Icon 5
  • 10.1111/1365-2478.13307
Seismic data interpolation using deeply supervised U‐Net++ with natural seismic training sets
  • Dec 21, 2022
  • Geophysical Prospecting
  • Geng Wu + 4 more

ABSTRACTInterpolation techniques provide an effective method for recovery of missing traces. In recent years, many researchers have applied deep learning methods to seismic data interpolation. Generally, one can choose synthetic data as a training set; however, the features of synthetic data are always inconsistent with those of field data, which may lead to inaccurate interpolation. Meanwhile, U‐Net is a common network structure used in seismic data interpolation; however, the four downsampling and upsampling structures of U‐Net have limited adaptability for different data. In this study, the deep learning method based on U‐Net++ was proposed for seismic data interpolation, which contains U‐Net with different depths. The different depths were connected by skip pathways, and the best depth of the network was chosen for different seismic data by deep supervision. Furthermore, a new strategy for training sets was designed: frequency‐wavenumber (f‐k) bandpass filters were used to convert natural images into a natural seismic training set, which has a stronger generalization capability than synthetic data as the training set. The characteristics of the new training set can effectively improve the accuracy of missing data reconstruction. Compared with the conventional U‐Net and traditional interpolation techniques, for example, the Fourier Bregman method, the proposed method produces more accurate and reasonable interpolation results. Further, it can reconstruct both irregular and regular missing seismic data, even in the presence of strong random noise and aliasing. Synthetic and field data tests showed the effectiveness, robustness and generalization of the proposed method.

  • Research Article
  • Cite Count Icon 49
  • 10.1109/jbhi.2022.3196697
Synthetic Patient Data Generation and Evaluation in Disease Prediction Using Small and Imbalanced Datasets
  • Jun 1, 2023
  • IEEE Journal of Biomedical and Health Informatics
  • Antonio J Rodriguez-Almeida + 8 more

The increasing prevalence of chronic non-communicable diseases makes it a priority to develop tools for enhancing their management. On this matter, Artificial Intelligence algorithms have proven to be successful in early diagnosis, prediction and analysis in the medical field. Nonetheless, two main issues arise when dealing with medical data: lack of high-fidelity datasets and maintenance of patient's privacy. To face these problems, different techniques of synthetic data generation have emerged as a possible solution. In this work, a framework based on synthetic data generation algorithms was developed. Eight medical datasets containing tabular data were used to test this framework. Three different statistical metrics were used to analyze the preservation of synthetic data integrity and six different synthetic data generation sizes were tested. Besides, the generated synthetic datasets were used to train four different supervised Machine Learning classifiers alone, and also combined with the real data. F1-score was used to evaluate classification performance. The main goal of this work is to assess the feasibility of the use of synthetic data generation in medical data in two ways: preservation of data integrity and maintenance of classification performance.

  • Conference Article
  • Cite Count Icon 21
  • 10.1117/12.805619
Using synthetic data safely in classification
  • Jan 18, 2009
  • Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
  • Jean Nonnemaker + 1 more

When is it safe to use synthetic training data in supervised classification? Trainable classifier technologies require large representative training sets consisting of samples labeled with their true class. Acquiring such training sets is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial, synthetic samples. Of course this immediately raises many questions, perhaps the first being "Why should we trust artificially generated data to be an accurate representative of the real distributions?" Other questions include "When will training on synthetic data work as well as - or better than training on real data ?". We distinguish between sample space (the set of real samples), parameter space (all samples that can be generated synthetically), and finally, feature space (the set of samples in terms of finite numerical values). In this paper, we discuss a series of experiments, in which we produced synthetic data in parameter space, that is, by convex interpolation among the generating parameters for samples and showed we could amplify real data to produce a classifier that is as accurate as a classifier trained on real data. Specifically, we have explored the feasibility of varying the generating parameters for Knuth's Metafont system to see if previously unseen fonts could also be recognized. We also varied parameters for an image quality model. We have found that training on interpolated data is for the most part safe, that is to say never produced more errors. Furthermore, the classifier trained on interpolated data often improved class accuracy.

  • Research Article
  • Cite Count Icon 2
  • 10.1080/10095020.2025.2514815
The impact of fractional cover distribution in training samples on the accuracy of fractional cover estimation: a model-based evaluation
  • Jul 9, 2025
  • Geo-spatial Information Science
  • Rujia Wang + 1 more

In machine learning-based fractional cover estimation, the fractional cover distribution in training samples critically influences model construction and, consequently the accuracy of the estimations. While some studies have descriptively compared the accuracies of machine learning-based estimations across training sets derived from different sampling methods, a significant gap remains in quantitatively analyzing how the fractional cover distribution in training samples affects accuracy. This study aims to bridge this gap by introducing descriptors for fractional cover distribution in the training set and establishing mathematical relationships between these descriptors and the accuracy of fractional cover estimation. We employed the Dirichlet distribution to characterize the joint fractional cover of multiple land classes and the Beta distribution for single-class cover. Subsequently, two descriptors were developed: the Kullback-Leibler (KL) divergence, measuring the similarity of fractional cover distributions for the target class between the training and test sets, and the geometric angle, representing the fractional cover distributions of the target class in the training set at the same KL divergence. Fractional cover estimation was performed using random forest regression, with accuracy assessed on an independent test set. The relationships between the KL divergence and accuracy, and between the geometric angle and accuracy at the same KL divergence, were modeled using univariate linear models and harmonic models, respectively. The combined effects of these descriptors on accuracy were further analyzed using coupled harmonic analysis and generalized additive models. Our experimental results, using both simulated and real data, demonstrated the effectiveness of these models. Given the strong explanatory power of the KL divergence in the accuracy of fractional cover estimation, we encourage researchers to report detailed statistical information of both training and test sets, enriching the understanding of model performance in fractional cover estimation.

  • Conference Article
  • Cite Count Icon 1
  • 10.1117/12.3015657
Mind the (domain) gap: metrics for the differences in synthetic and real data distributions
  • Jun 7, 2024
  • Ashley Dale + 4 more

Synthetic data are frequently used to supplement a small set of real images and create a dataset with diverse features, but this may not improve the equivariance of a computer vision model. Our work answers the following questions: First, what metrics are useful for measuring a domain gap between real and synthetic data distributions? Second, is there an effective method for bridging an observed domain gap? We explore these questions by presenting a pathological case where the inclusion of synthetic data did not improve model performance, then presenting measurements of the difference between the real and synthetic distributions in the image space, latent space, and model prediction space. We find that augmenting the dataset with pixel-level augmentation effectively reduced the observed domain gap, and improves the model F1 score to 0.95 compared to 0.43 for un-augmented data. We also observe that an increase in the average cross entropy of the latent space feature vectors is positively correlated with increased model equivariance and the closing of the domain gap. The results are explained using a framework of model regularization effects.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 45
  • 10.1016/j.cmpb.2021.106371
A deep learning approach for synthetic MRI based on two routine sequences and training with synthetic data
  • Aug 31, 2021
  • Computer Methods and Programs in Biomedicine
  • Elisa Moya-Sáez + 3 more

Background and Objective: Synthetic magnetic resonance imaging (MRI) is a low cost procedure that serves as a bridge between qualitative and quantitative MRI. However, the proposed methods require very specific sequences or private protocols which have scarcely found integration in clinical scanners. We propose a learning-based approach to compute T1, T2, and PD parametric maps from only a pair of T1- and T2-weighted images customarily acquired in the clinical routine.Methods: Our approach is based on a convolutional neural network (CNN) trained with synthetic data; specifically, a synthetic dataset with 120 volumes was constructed from the anatomical brain model of the BrainWeb tool and served as the training set. The CNN learns an end-to-end mapping function to transform the input T1- and T2-weighted images to their underlying T1, T2, and PD parametric maps. Then, conventional weighted images unseen by the network are analytically synthesized from the parametric maps. The network can be fine tuned with a small database of actual weighted images and maps for better performance.Results:This approach is able to accurately compute parametric maps from synthetic brain data achieving normalized squared error values predominantly below 1%. It also yields realistic parametric maps from actual MR brain acquisitions with T1, T2, and PD values in the range of the literature and with correlation values above 0.95 compared to the T1 and T2 maps obtained from relaxometry sequences. Further, the synthesized weighted images are visually realistic; the mean square error values are always below 9% and the structural similarity index is usually above 0.90. Network fine tuning with actual maps improves performance, while training exclusively with a small database of actual maps shows a performance degradation.Conclusions:These results show that our approach is able to provide realistic parametric maps and weighted images out of a CNN that (a) is trained with a synthetic dataset and (b) needs only two inputs, which are in turn obtained from a common full-brain acquisition that takes less than 8 min of scan time. Although a fine tuning with actual maps improves performance, synthetic data is crucial to reach acceptable performance levels. Hence, we show the utility of our approach for both quantitative MRI in clinical viable times and for the synthesis of additional weighted images to those actually acquired.

  • Research Article
  • 10.3389/fpls.2025.1604088
Enhancing buckwheat maturity classification with generative adversarial networks for spectroscopy data augmentation
  • Jul 8, 2025
  • Frontiers in Plant Science
  • Huihui Wang + 7 more

IntroductionThe optimal harvest period for buckwheat is challenging to determine due to its short growth cycle. Harvesting too early or too late can negatively affect the quality of the crop. Traditional harvest methods are labor-intensive and fail to account for the spatial variability in buckwheat quality within a field. This study explores the use of near-infrared (NIR) spectral data to classify the maturity stages of buckwheat.MethodFour distinct developmental stages were examined: UM (Unripe Maturity), representing buckwheat harvested at 65 days after sowing; HM (Half Maturity), harvested at 75 days; MS (Full Maturity with Shell), harvested at 85 days with husks intact; and MUS (Full Maturity Unhulled Sample), also harvested at 85 days but manually dehulled. Unlike traditional machine learning models, which require diverse and extensive datasets, this study investigates the use of a conditional WGAN-GP to generate synthetic datasets and improve model performance. Four machine learning models were employed in this study: Support Vector Machine (SVM), Random Forest (RF), k-Nearest Neighbors (KNN), and Partial Least Squares Linear Discriminant Analysis (PLS-LDA).Results and DiscussionThe conditional WGAN with the gradient penalty was trained for a range of epochs: 1000, 2000, 8000, 10,000, and 20,000. After training 10,000 epochs, synthetic hyperspectral reflectance data were very similar to real spectra for each maturity category. To assess the impact of conditional WGAN-GP data augmentation, model performance was first evaluated using the original dataset as a baseline, showing PLS-LDA had the best classification performance with accuracy of 95% and kappa coefficient of 0.93. The models were then trained on a combination of original and synthetic data, revealing that synthetic data can improve the classification model performance for RF and KNN. The best classification performance was achieved by RF with an accuracy of 97% and kappa coefficient of 0.94. This study demonstrates the effectiveness of synthetic data in enhancing classification accuracy.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant