Abstract

Synthetic data generation is critical in machine and deep learning research to overcome the shortage of samples or dataset sizes. Various algorithms, including the generative adversarial network and autoencoder models, have been applied to generate artificial datasets in previous studies. In this study, we propose a synthetic data generation framework for a tabular dataset collected from cognitive psychology behavioral experiments based on deep learning algorithms. Tabular datasets for the Stroop task were used to develop our framework. On account of the relatively small sample size (N=102) of the dataset used in our study, we used a pre-trained generative adversarial network model to complement the size of the dataset. Furthermore, we proposed and applied five evaluation methods with statistical tests (overlapped sample test, constraint reflection test, correlation reflection test, distribution distance test, and feature distance test) to validate generation performance based on internal levels of table structure (instance level, feature level, and whole-set level evaluations). The proposed framework with a fine-tuned generative adversarial network algorithm was compared with a random generation method to verify generation performance, including the representation of the statistical characteristics of the original datasets. We found that the generated datasets from the proposed framework exhibited more similar statistical characteristics with the original dataset than the randomly generated datasets based on five evaluation methods. The results of this study provide not only generation algorithms for cognitive psychological datasets with tabular type but also a solution to the sample size issue for researchers.

Highlights

  • Sample or dataset size is considered a critical factor for various data analysis methodologies, including statistical and machine learning methods [1,2,3,4]

  • We propose a synthetic data generation framework for a tabular dataset collected from cognitive psychology behavioral experiments based on deep learning algorithms

  • We found that the generated datasets from the proposed framework exhibited more similar statistical characteristics with the original dataset than the randomly generated datasets based on five evaluation methods

Read more

Summary

Introduction

Sample or dataset size is considered a critical factor for various data analysis methodologies, including statistical and machine learning methods [1,2,3,4]. In terms of statistical analysis, many statistical tests require an appropriate sample size to verify the power or reliability of the results [5, 6]. Lachin et al suggested the importance of sample size determination and power analysis in clinical trials [7]. Maccallum et al introduced a framework to determine the minimum sample size for power in empirical behavioral research [8]. An adequate dataset size is essential for machine and deep learning methodologies. Sun et al suggested a relationship between dataset size and model performance in visual deep learning models [13]

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call