Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

When AI Agents Take Surveys: Protecting Data Integrity in Business and Marketing Research

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The increasing use of crowdsourcing platforms for behavioural research rests on the assumption that research participants are exclusively human. This assumption is now under threat. AI agents from browsers such as OpenAI’s Atlas and Perplexity’s Comet can autonomously complete online surveys. These agents can simulate specific personas or demographic profiles and follow survey prompts, select responses and submit data with fluency and internal consistency. Such capabilities threaten data authenticity and integrity, especially as subjective perception, motivation and emotion are central in behavioural research. This research note outlines practical mitigation strategies to detect AI responses. In addition to immediate measures, the emergence of AI-generated survey data requires broader methodological reflection, updated ethical guidelines and transparent reporting practices. We also situate these risks within the emerging literature on synthetic data, distinguishing unauthorised AI-generated responses from the transparent and theory-driven use of synthetic data for research purposes. Finally, we offer a forward-looking research agenda for protecting human data while responsibly engaging with synthetic data in marketing research. Instead of treating AI solely as a threat, researchers can use this as an opportunity to strengthen methodological rigour and protect the authenticity of human data in an increasingly automated research environment.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 24
  • 10.1007/s11263-024-02102-x
Synthetic Data for Video Surveillance Applications of Computer Vision: A Review
  • May 17, 2024
  • International Journal of Computer Vision
  • Rita Delussu + 2 more

In recent years, there has been a growing interest in synthetic data for several computer vision applications, such as automotive, detection and tracking, surveillance, medical image analysis and robotics. Early use of synthetic data was aimed at performing controlled experiments under the analysis by synthesis approach. Currently, synthetic data are mainly used for training computer vision models, especially deep learning ones, to address well-known issues of real data, such as manual annotation effort, data imbalance and bias, and privacy-related restrictions. In this work, we survey the use of synthetic training data focusing on applications related to video surveillance, whose relevance has rapidly increased in the past few years due to their connection to security: crowd counting, object and pedestrian detection and tracking, behaviour analysis, person re-identification and face recognition. Synthetic training data are even more interesting in this kind of application, to address further, specific issues arising, e.g., from typically unconstrained image or video acquisition conditions and cross-scene application scenarios. We categorise and discuss the existing methods for creating synthetic data, analyse the synthetic data sets proposed in the literature for each of the considered applications, and provide an overview of their effectiveness as training data. We finally discuss whether and to what extent the existing synthetic data sets mitigate the issues of real data, highlight existing open issues, and suggest future research directions in this field.

  • Research Article
  • Cite Count Icon 16
  • 10.1002/cpt.3001
A case for synthetic data in regulatory decision-making in Europe.
  • Aug 24, 2023
  • Clinical Pharmacology & Therapeutics
  • Clara Alloza + 12 more

Regulators are faced with many challenges surrounding health data usage, including privacy, fragmentation, validity, and generalizability, especially in the European Union (EU), for which synthetic data may provide innovative solutions. Synthetic data, defined as data artificially generated rather than captured in the real world, are increasingly being used for healthcare research purposes as a proxy to real-world data (RWD). Currently, there are barriers particularly challenging in Europe, where sharing patient's data is strictly regulated, costly, and time consuming, causing delays in evidence generation and regulatory approvals. Recent initiatives are encouraging the use of synthetic data in regulatory decision-making and health technology assessment to overcome these challenges, but synthetic data have still to overcome realistic obstacles before their adoption by researchers and regulators in Europe. Thus, the emerging use of RWD and synthetic data by pharmaceutical and medical device industries calls regulatory bodies to provide a framework for proper evidence generation and informed regulatory decision-making. As the provision of data becomes more ubiquitous in scientific research, so will innovations in artificial intelligence, machine learning, and generation of synthetic data, making the exploration and intricacies of this topic all the more important and timely. In this review, we discuss the potential merits and challenges of synthetic data in the context of decision-making in the European regulatory environment. We explore the current uses of synthetic data and ongoing initiatives, the value of synthetic data for regulatory purposes, and realistic barriers to the adoption of synthetic data in healthcare.

  • Research Article
  • 10.21203/rs.3.rs-8497559/v1
A novel pipeline for realistic synthetic longitudinal EHR data generation
  • Jan 29, 2026
  • Research Square
  • Gabrielle Josling + 2 more

BackgroundSynthetic health data offers a promising means of sharing clinical information without compromising patient privacy. However, existing methods often produce outputs that differ in structure from real data and are evaluated in narrow contexts, limiting their practical use in downstream analytical workflows. This study introduces a pipeline that builds upon existing methods for generating realistic synthetic longitudinal electronic health record data, evaluates it across three diverse datasets, and offers evidence-based guidance on the use of synthetic data to replace or augment real data.MethodsThe pipeline extends existing state of the art HALO and ConSequence frameworks with a post-processing step that reconstructs continuous variables and timestamps, producing synthetic data that closely matches the structure of real medical record datasets. It was applied to three clinically diverse datasets: a small longitudinal cohort, a medium-sized intensive-care dataset, and a very large multi-hospital administrative dataset. Realism was assessed alongside utility for machine learning, statistical modelling, and time series analysis tasks.ResultsAcross all datasets, the pipeline generated realistic synthetic data that preserved key statistical properties and relationships. Machine learning models trained on synthetic data achieved similar predictive accuracy and feature importance patterns to those trained on real data, indicating strong utility. Synthetic data also performed well in statistical modelling, with the direction and magnitude of effects generally closely aligned with the real data. However, it may be less suitable when precise estimates are required or when modelling relatively rare conditions. Importantly, although the pipeline reconstructed timestamp structures, it did not capture aggregate temporal patterns and the resulting data was therefore unsuitable for time series analysis.ConclusionsThe pipeline produces realistic and analytically useful synthetic longitudinal electronic health record data across datasets of widely varying scales. These findings provide practical guidance on when synthetic data can meaningfully substitute for or complement real data.

  • Research Article
  • Cite Count Icon 1
  • 10.3934/aci.2024009
Finnish perspective on using synthetic health data to protect privacy: the PRIVASA project
  • Jan 1, 2024
  • Applied Computing and Intelligence
  • Tinja Pitkämäki + 15 more

<p>The use of synthetic data could facilitate data-driven innovation across industries and applications. Synthetic data can be generated using a range of methods, from statistical modeling to machine learning and generative AI, resulting in datasets of different formats and utility. In the health sector, the use of synthetic data is often motivated by privacy concerns. As generative AI is becoming an everyday tool, there is a need for practice-oriented insights into the prospects and limitations of synthetic data, especially in the privacy sensitive domains. We present an interdisciplinary outlook on the topic, focusing on, but not limited to, the Finnish regulatory context. First, we emphasize the need for working definitions to avoid misplaced assumptions. Second, we consider use cases for synthetic data, viewing it as a helpful tool for experimentation, decision-making, and building data literacy. Yet the complementary uses of synthetic datasets should not diminish the continued efforts to collect and share high-quality real-world data. Third, we discuss how privacy-preserving synthetic datasets fall into the existing data protection frameworks. Neither the process of synthetic data generation nor synthetic datasets are automatically exempt from the regulatory obligations concerning personal data. Finally, we explore the future research directions for generating synthetic data and conclude by discussing potential future developments at the societal level.</p>

  • Research Article
  • Cite Count Icon 3
  • 10.15593/2499-9873/2023.4.01
Review of methods and systems for generation of synthetic training data
  • Dec 15, 2023
  • Applied Mathematics and Control Sciences
  • A N Rabchevsky

It is impossible to imagine the advancement of modern artificial intelligence systems without neural network technologies. During the design process researchers are often faced with the fact that there is not enough data to train modern neural network models, these data may be unbalanced or highly sparse. Often it happens that real data simply does not exist, as the research field is still emerging. A relevant problem is ensuring the confidentiality of real personal or patient medical data, which is used in the exchange between researchers or in the testing of various neural network systems. In many subject areas, the cost of collecting and marking up real data can be very high. Synthetic data is increasingly being used to solve these problems. The purpose of this publication is to introduce readers to advances in the generation and use of synthetic data. The paper presents a description of various methods, systems and software tools used to generate synthetic data, which can help to improve neural network models. Since an entire industry for synthetic data production has already formed, the leading data synthesis technology platforms are presented. The paper is of an overview nature, so it contains an extensive bibliography. The value of the article lies in the fact that this review will help readers broaden their understanding of the use of synthetic data in solving a wide range of neural network problems, as well as to become more familiar with the methods and tools for their generation.

  • Conference Article
  • 10.54941/ahfe1006801
Data Synthetization and Feature Analysis: A Study in Bladder Cancer Recurrence Data
  • Jan 1, 2025
  • AHFE international
  • Sandi Baressi Šegota + 7 more

The application of synthetic data within the biomedical domain is rapidly gaining momentum, driven by the growing need for robust datasets suitable for machine learning (ML) and statistical modeling. In scenarios where access to real patient data is limited due to privacy concerns or scarcity, synthetic data offers an attractive alternative. These artificially generated datasets aim to mimic the statistical characteristics of original data, enabling researchers to conduct exploratory analysis, develop predictive models, or validate findings without compromising patient confidentiality. However, the increasing use of synthetic data raises several methodological and interpretative challenges, particularly regarding the correct sequence and context for applying statistical analyses. One of the central issues identified in contemporary literature concerns the timing of data analysis relative to the synthetic data generation process. Some studies conduct statistical or ML analyses directly on real datasets and use synthetic data for validation or augmentation. Others, conversely, perform all stages of analysis including feature importance estimation, correlation assessment, and model training on synthetic data. This inconsistency raises the question of whether statistical analysis conducted solely on synthetic datasets yields reliable insights, or whether it constitutes a methodological flaw. The prevailing assumption is that analysis should ideally be performed on real data to preserve statistical integrity, but empirical evaluation of this notion remains limited. In the current study, the authors address this issue by applying a synthetic data generation method specifically, the Tabular Variational Auto encoder (TVAE) to a biomedical dataset focused on bladder cancer recurrence. This dataset includes various diagnostic variables, and the primary goal is to assess how well synthetic data replicates analytical insights drawn from the original data. To achieve this, the authors conduct both correlational analysis and machine learning-based feature importance estimation. The results derived from synthetic datasets of varying sizes are then compared to those obtained from the original data. The findings indicate that while synthetic data can approximate general trends observed in the original dataset, there are notable differences depending on the analytical technique employed. In particular, models such as Random Forest appear more sensitive to variations introduced during the synthetization process. This sensitivity manifests as shifts in feature importance rankings and variability in predictive performance, especially when working with smaller synthetic datasets. On the other hand, simpler statistical methods such as correlation coefficients display more stability, suggesting that some analytical approaches may be more robust to data generation artifacts than others. These observations underscore the importance of methodological caution when interpreting results based on synthetic biomedical data. While synthetic datasets hold considerable promise for advancing data-driven research in biomedicine, they are not a one-size-fits-all solution. The sequence in which synthetic data is introduced into the research pipeline whether before or after statistical analysis—can significantly influence the validity of the findings. As such, researchers must critically assess the suitability of synthetic data for specific analytical tasks and ensure transparency in reporting their methodological choices. Future work should further explore the impact of different generative models and dataset properties on the reliability of synthetic-data-driven insights.

  • Abstract
  • Cite Count Icon 1
  • 10.23889/ijpds.v9i5.2766
SynD: Australian synthetic health data community of practice
  • Sep 10, 2024
  • International Journal of Population Data Science
  • Ben Hachey + 10 more

ObjectivesThe current workflow for health data research in Australia is inefficient. After funding is secured, researchers often face delays of months or years to access the necessary data. Synthetic data could significantly improve the pace and impact of health data research but lacks foundational infrastructure. We aim to develop this infrastructure and support the use of synthetic data to improve data access and research quality across Australia. ApproachWe held two workshops with Australian groups working on synthetic data. The format included participant updates and invited talks on international approaches to synthetic data and health data research. Workshops collected use cases and stimulated discussion on national collaboration. A facilitator then led thematic analysis to draft a consensus roadmap and terms of reference towards national synthetic data infrastructure. ResultsWe recruited 18 participants. Participants were cross sectoral: universities (9), research funding bodies (5), state health departments (4). Represented six states and territories: Queensland (6), New South Wales (3), Victoria (3), Western Australia (3), Australian Capital Territory (2), South Australia (1). Gender: women (11), men (7). The roadmap includes stakeholder engagement, a governance framework, and training events. ConclusionSynD is an Australian community of practice for synthetic health data. Our mission is to unlock the value of health information through synthetic data to advance research, education, innovation and service delivery within the health and care sector. This collaborative effort should ensure a harmonised approach to the safe and effective utilisation of synthetic data to enhance health outcomes across Australia.

  • Research Article
  • Cite Count Icon 8
  • 10.1177/20539517251318289
The ontological politics of synthetic data: Normalities, outliers, and intersectional hallucinations
  • Apr 13, 2025
  • Big Data & Society
  • Francis Lee + 2 more

Synthetic data is increasingly used as a substitute for real data due to ethical, legal, and logistical reasons. However, the rise of synthetic data also raises critical questions about its entanglement with the politics of classification and the reproduction of social norms and categories. This paper aims to problematize the use of synthetic data by examining how its production is intertwined with the maintenance of certain worldviews and classifications. We argue that synthetic data, like real data, is embedded with societal biases and power structures, leading to the reproduction of existing social inequalities. Through empirical examples, we demonstrate how synthetic data tends to highlight majority elements as the “normal” and minimize minority elements, and that the slight changes to the data structures that create synthetic data will also inevitably result in what we term “intersectional hallucinations.” These hallucinations are inherent to synthetic data and cannot be entirely eliminated without compromising the purpose of creating synthetic datasets. We contend that decisions about synthetic data involve determining which intersections are essential and which can be disregarded, a practice which will imbue these decisions with norms and values. Our study underscores the need for critical engagement with the mathematical and statistical choices in synthetic data production and advocates for careful consideration of the ontological and political implications of these choices during curatorial style production of synthetic structured data.

  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.isci.2025.112382
On the fidelity versus privacy and utility trade-off of synthetic patient data
  • Apr 14, 2025
  • iScience
  • Tim Adams + 8 more

SummaryThe use of synthetic data is a widely discussed and promising solution for privacy-preserving medical research. Synthetic data may, however, not always rule out the risk of re-identifying characteristics of real patients and can vary greatly in terms of data fidelity and utility. We systematically evaluate the trade-offs between privacy, fidelity, and utility across five synthetic data models and three patient-level datasets. We evaluate fidelity based on statistical similarity to the real data, utility on three machine learning use cases, and privacy via membership inference, singling out, and attribute inference risks. Synthetic data without differential privacy (DP) maintained fidelity and utility without evident privacy breaches, whereas DP-enforced models significantly disrupted correlation structures. K-anonymity-based data sanitization of demographic features, while preserving fidelity, introduced notable privacy risks. Our findings emphasize the need to advance methods that effectively balance privacy, fidelity, and utility in synthetic patient data generation.

  • Conference Article
  • Cite Count Icon 2
  • 10.54941/ahfe1005071
Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations
  • Jan 1, 2024
  • AHFE international
  • Martin Thissen + 7 more

The examination of the musculoskeletal system in dogs is a challenging task in veterinary practice. The careful diagnosis as well as the evaluation of very complex findings is getting increasingly important. Therefore, a novel method has been developed that enables efficient documentation of a dog's condition through a visual representation. However, since the visual documentation is new, there is no existing training data. The objective of this work is therefore to mitigate the impact of data scarcity in order to develop an AI-based diagnostic support system that can provide veterinarians with accurate predictions. To this end, the potential of synthetic data that mimics realistic visual documentations of diseases for pre-training AI models is investigated. Specifically, this work explores whether pre-training an AI model with synthetic data can improve the overall accuracy of canine musculoskeletal diagnoses.We propose a method for generating synthetic image data that mimics realistic visual documentations. Initially, a basic dataset containing three distinct classes is generated, followed by the creation of a more sophisticated dataset containing 36 different classes. Both datasets are used for the pre-training of an AI model, adapting it to the domain of visual documentations. Subsequently, an evaluation dataset is created, consisting of 250 manually created visual documentations for five different diseases. This dataset, along with a subset containing 25 examples, serves as the basis for evaluating the efficacy of pre-training an AI model on synthetic data.The obtained results on the evaluation dataset containing 25 examples demonstrate a significant enhancement of approximately 10% in diagnosis accuracy when utilizing generated synthetic images that mimic real-world visual documentations. However, these results do not hold true for the larger evaluation dataset containing 250 examples, indicating that the advantages of using synthetic data for pre-training an AI model emerge primarily when dealing with few examples of visual documentations for a given disease. This implies that the use of synthetic data may not be necessary for diseases with many visual documentation examples.Overall, this work provides valuable insights into mitigating the limitations imposed by limited training data through the strategic use of generated synthetic data, presenting an approach applicable beyond the canine musculoskeletal assessment domain.

  • Research Article
  • Cite Count Icon 5
  • 10.1123/ijspp.2023-0007
Synthetic Data as a Strategy to Resolve Data Privacy and Confidentiality Concerns in the Sport Sciences: Practical Examples and an R Shiny Application.
  • Oct 1, 2023
  • International Journal of Sports Physiology and Performance
  • Mitchell Naughton + 3 more

There has been a proliferation in technologies in the sport performance environment that collect increasingly larger quantities of athlete data. These data have the potential to be personal, sensitive, and revealing and raise privacy and confidentiality concerns. A solution may be the use of synthetic data, which mimic the properties of the original data. The aim of this study was to provide examples of synthetic data generation to demonstrate its practical use and to deploy a freely available web-based R Shiny application to generate synthetic data. Openly available data from 2 previously published studies were obtained, representing typical data sets of (1)field- and gym-based team-sport external and internal load during a preseason period (n = 28) and (2)performance and subjective changes from before to after the posttraining intervention (n = 22). Synthetic data were generated using the synthpop package in R Studio software, and comparisons between the original and synthetic data sets were made through Welch t tests and the distributional similarity standardized propensity mean squared error statistic. There were no significant differences between the original and more synthetic data sets across all variables examined in both data sets (P > .05). Further, there was distributional similarity (ie,low standardized propensity mean squared error) between the original observed and synthetic data sets. These findings highlight the potential use of synthetic data as a practical solution to privacy and confidentiality issues. Synthetic data can unlock previously inaccessible data sets for exploratory analysis and facilitate multiteam or multicenter collaborations. Interested sport scientists, practitioners, and researchers should consider utilizing the shiny web application (SYNTHETIC DATA-available at https://assetlab.shinyapps.io/SyntheticData/).

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 19
  • 10.1109/access.2022.3156073
Fine Grain Synthetic Educational Data: Challenges and Limitations of Collaborative Learning Analytics
  • Jan 1, 2022
  • IEEE Access
  • Brendan Flanagan + 2 more

While data privacy is a key aspect of Learning Analytics, it often creates difficulty when promoting research into underexplored contexts as it limits data sharing. To overcome this problem, the generation of synthetic data has been proposed and discussed within the LA community. However, there has been little work that has explored the use of synthetic data in real-world situations. This research examines the effectiveness of using synthetic data for training academic performance prediction models, and the challenges and limitations of using the proposed data sharing method. To evaluate the effectiveness of the method, we generate synthetic data from a private dataset, and distribute it to the participants of a data challenge to train prediction models. Participants submitted their models as docker containers for evaluation and ranking on holdout synthetic data. A post-hoc analysis was conducted on the top 10 participant’s models by comparing the evaluation of their performance on synthetic and private validation datasets. Several models trained on synthetic data were found to perform significantly poorer when applied to the non-synthetic private dataset. The main contribution of this research is to understand the challenges and limitations of applying predictive models trained on synthetic data in real-world situations. Due to these challenges, the paper recommends model designs that can inform future successful adoption of synthetic data in real-world educational data systems.

  • Book Chapter
  • Cite Count Icon 1
  • 10.4018/979-8-3693-1886-7.ch010
The Privacy-Preserving High-Dimensional Synthetic Data Generation and Evaluation in the Healthcare Domain
  • Apr 19, 2024
  • Chandrakant Mallick + 2 more

In the fast-changing environment of healthcare research and technology, there is an increasing demand for varied and vast information. However, issues with data privacy, unavailability, and ethical considerations frequently limit smooth access to true high-dimensional healthcare data. This research investigates a viable approach to addressing these challenges: the use of high-dimensional synthetic data in the healthcare area. The authors investigate the potentials and uses of synthetic data production through a review of current literature and methodology, providing insights into its role in overcoming data access barriers, fostering innovation, and supporting evidence-based decision making. The chapter outlines significant use cases, such as simulation and prediction research, hypothesis and algorithm testing, epidemiology, health information technology development, teaching and training, public dataset release, and data connecting.

  • Conference Article
  • Cite Count Icon 3
  • 10.1117/12.321862
<title>Upper bound calculations of ATR performance for ladar sensors</title>
  • Sep 15, 1998
  • Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
  • Vince E Diehl + 2 more

The use of robust and representative synthetic imagery data to test and evaluate automatic target recognition (ATR) systems has long been desired but generally considered beyond the current state of the art. The use of synthetic data is investigated here to calculate upper bounds on potential ATR system performance. This paper presents the use of synthetically generated imagery templates as a means of developing upper bounds of ATR performance for laser radar based seekers. This approach employs a synthetic scene generation capability and integrates it with error models that represent decrements in performance due to resolution, noise and geometric distortion resulting from the sensing process. This paper describes the modeling approach take and presents preliminary results. The model is currently undergoing testing against real imagery and is being used to select test sets to more effectively evaluate ATR's.

  • Research Article
  • Cite Count Icon 61
  • 10.1148/radiol.232471
Generating Synthetic Data for Medical Imaging.
  • Sep 1, 2024
  • Radiology
  • Lennart R Koetzier + 10 more

Artificial intelligence (AI) models for medical imaging tasks, such as classification or segmentation, require large and diverse datasets of images. However, due to privacy and ethical issues, as well as data sharing infrastructure barriers, these datasets are scarce and difficult to assemble. Synthetic medical imaging data generated by AI from existing data could address this challenge by augmenting and anonymizing real imaging data. In addition, synthetic data enable new applications, including modality translation, contrast synthesis, and professional training for radiologists. However, the use of synthetic data also poses technical and ethical challenges. These challenges include ensuring the realism and diversity of the synthesized images while keeping data unidentifiable, evaluating the performance and generalizability of models trained on synthetic data, and high computational costs. Since existing regulations are not sufficient to guarantee the safe and ethical use of synthetic images, it becomes evident that updated laws and more rigorous oversight are needed. Regulatory bodies, physicians, and AI developers should collaborate to develop, maintain, and continually refine best practices for synthetic data. This review aims to provide an overview of the current knowledge of synthetic data in medical imaging and highlights current key challenges in the field to guide future research and development.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant