Towards Synthetic Trace Generation of Modeling Operations using In-Context Learning Approach
Producing accurate software models is crucial in model-driven software engineering (MDE). However, modeling complex systems is an error-prone task that requires deep application domain knowledge. In the past decade, several automated techniques have been proposed to support academic and industrial practitioners by providing relevant modeling operations. Nevertheless, those techniques require a huge amount of training data that cannot be available due to several factors, e.g., privacy issues. The advent of large language models (LLMs) can support the generation of synthetic data although state-of-the-art approaches are not yet supporting the generation of modeling operations. To fill the gap, we propose a conceptual framework that combines modeling event logs, intelligent modeling assistants, and the generation of modeling operations using LLMs. In particular, the architecture comprises modeling components that help the designer specify the system, record its operation within a graphical modeling environment, and automatically recommend relevant operations. In addition, we generate a completely new dataset of modeling events by telling on the most prominent LLMs currently available. As a proof of concept, we instantiate the proposed framework using a set of existing modeling tools employed in industrial use cases within different European projects. To assess the proposed methodology, we first evaluate the capability of the examined LLMs to generate realistic modeling operations by relying on well-founded distance metrics. Then, we evaluate the recommended operations by considering real-world industrial modeling artifacts. Our findings demonstrate that LLMs can generate modeling events even though the overall accuracy is higher when considering human-based operations. In this respect, we see generative AI tools as an alternative when the modeling operations are not available to train traditional IMAs specifically conceived to support industrial practitioners.
- Research Article
- 10.1164/ajrccm.2025.211.abstracts.a5686
- May 1, 2025
- American Journal of Respiratory and Critical Care Medicine
Rationale: The scarcity of annotated training data presents a significant challenge for developing robust deep learning (DL) models for practical use in healthcare. We explore a novel synthetic data generation pipeline to augment limited real-world datasets for use in training DL networks in pulmonology. Through augmentation of existing datasets with a wide variety of disease presentations, we are able to address the problem of biased training data distributions which do not contain all representations of disease presentations, especially severe cases which may occur infrequently. This research explores data augmentation for the prediction of Radiographic Assessment of Lung Edema (RALE) scoring of chest x-ray (CXR) opacities in respiratory illnesses. Methods: Our methodology consists of several key components: establishing a healthy CXR dataset through real data or generative AI, lung segmentation, quadrant vertex location identification, synthetic noise generation using perlin noise, quadrant-wise opacity analysis, and RALE score computation. Our approach begins with collecting healthy CXRs and their corresponding lung segmentation masks. We then apply a parameterized noise generation algorithm to induce realistic edema-like patterns within the lung fields. Synthetic illness is represented at all levels of density opacification. Once generated, computer vision (CV) approaches are employed to the synthetic opacity patterns to compute RALE scores by analyzing density and extent within each lung quadrant, following the established clinical criteria [Warren, MA et al, 2018]. This approach enables the generation of a large-scale synthetic dataset with known ground truth RALE scores. Results:For our Siamese CNN baseline we found good inter-rater agreement (IRA) between physicians and predicted RALE scores (ICC = 0.74, 95% confidence interval [0.71, 0.77], p < 0.001; MSE = 80.98). When synthetically generated CXRs were added, the IRA slightly worsened while MSE improved (ICC = 0.65, [0.60, 0.70], p < 0.001; MSE = 65.41). Conclusion:This research shows that augmenting limited real-world CXR datasets with synthetically generated data CXRs containing varying degrees of edema-like opacities for use in DL training may potentially improve results. By combining the synthetic data with real annotated cases for model training while maintaining independent real-world data for testing, we aim to improve the robustness and generalizability of deep learning models for lung edema assessment. This framework not only addresses the immediate challenge of limited training data for RALE score prediction but also establishes a generalizable approach for synthetic medical image generation that could extend to other radiographic scoring systems and pathologies.
- Conference Article
37
- 10.1109/smartgridcomm.2017.8340657
- Oct 1, 2017
Well annotated power consumption traces are a crucial prerequisite for the development and analysis of load disaggregation algorithms. Due to the high efforts required to collect such traces in the real world, their synthetic generation has emerged as a viable alternative. However, many current models for the synthetic trace generation simply combine statistical information about household occupancy with the energy consumptions of the most frequently performed user activities. While this may suffice for high-level analyses (i.e., considering groups of households or entire cities), such models do not reflect the actual diversity of consumption signatures in real data. We overcome this limitation in this paper by presenting a system design to model appliance power consumption at a user-definable accuracy. Our Automated Model Builder for Appliance Loads (AMBAL) allows to derive models from real device power consumption data collected by means of smart plugs. These models are represented by sequences of parametrized signatures; each model's complexity is kept minimized for its desired level of accuracy. We evaluate the accuracy of AMBAL's models for device traces with consumption patterns of different complexity, taken from existing appliance-level data sets. Moreover, a synthetic appliance trace generator is presented which allows to recombine appliance models in an effort to simulate user activities in homes with a definable complexity. The generated data is valuable for the development of data analysis algorithms (e.g., Non-Intrusive Load Monitoring), and we integrate it with the NILMTK framework to demonstrate that a similar disaggregation performance is achieved for actual and generated traces.
- Conference Article
6
- 10.1109/ispass.2003.1190235
- Mar 6, 2003
We propose a new synthetic trace generation methodology for web server performance benchmarking. We discuss the two primary existing approaches to synthetic trace generation in terms of queueing models. We propose a new methodology as the natural combination of these two approaches. This hybrid approach permits natural modeling of client content and server session caching as well as a natural correspondence between the workload model parameters and the statistics of the resulting synthetic trace.
- Conference Article
87
- 10.1109/ispass.2000.842273
- Apr 24, 2000
Most research in the area of microarchitectural performance analysis is done using trace-driven simulations. Although trace-driven simulations are fairly accurate, they are both time- and space-consuming which makes them sometimes impractical. Modeling the execution of a computer program by a statistical profile and generating a synthetic benchmark trace from this statistical profile can be used to accelerate the design process. Thanks to the statistical nature of this technique, performance characteristics quickly converge to a steady state solution during simulation, which makes this technique suitable for fast design space explorations. In this paper, it is shown how more detailed statistical profiles can be obtained and how the synthetic trace generation mechanism should be designed to generate syntactically correct benchmark traces. As a result, the performance predictions in this paper are far more accurate than those reported in previous research.
- Conference Article
- 10.1109/icmla.2017.00-86
- Dec 1, 2017
We present a novel approach for accurate characterization of workloads, which is relevant in the context of complex big data applications.Workloads are generally described with statistical models and are based on the analysis of resource requests measurements of a running program. In this paper we propose to consider the sequence of virtual memory references generated from a program during its execution as a temporal series, and to use spectral analysis principles to process the sequence. However, the sequence is time-varying, so we employed processing approaches based on Ergodic Continuous Hidden Markov Models (ECHMMs) which extend conventional stationary spectral analysis approaches to the analysis of time-varying sequences. In this work, we describe two applications of the proposed approach: the on-line classification of a running process and the generation of synthetic traces of a given workload. The first step was to show that ECHMMs accurately describe virtual memory sequences; to this goal a different ECHMM was trained for each sequence and the related run-time average process classification accuracy, evaluated using trace driven simulations over a wide range of traces of SPEC2000, was about 82%. Then, a single ECHMM was trained using all the sequences obtained from a given running application; again, the classification accuracy has been evaluated using the same traces and it resulted about 76%. As regards the synthetic trace generation, a single ECHMM characterizing a given application has been used as a stochastic generator to produce benchmarks for spanning a large application space.
- Research Article
21
- 10.1016/j.jnca.2024.103926
- Jun 20, 2024
- Journal of Network and Computer Applications
Network Intrusion Detection Systems (NIDS) are crucial tools for protecting networked devices from cyberattacks. Recent development in the field of Artificial Intelligence (AI) has provided tremendous advantages in implementing NIDSs able to monitor network traffic and block cyberattacks in real-time. In the literature, it is widely recognized that the effective training of a NIDS requires a large quantity of labeled traffic, representative of attacks. Nonetheless, the availability of public and abundant datasets remains remarkably restricted due to the cost of gathering and labeling real traffic traces and privacy concerns for sharing them. To tackle these challenges, in this paper we present a generative AI model capable of synthesizing anonymized traffic traces from real ones, thus dealing with privacy, abundance, and representativeness. The proposal is based on a Conditional Variational Autoencoder (CVAE) and a preprocessing procedure specifically designed for the generation of new traffic traces. To validate our solution, we conduct an extensive empirical study leveraging three recent and publicly-available datasets, containing benign and malicious traffic. The validation is carried out from both the perspectives of classification performance of a robust NIDS and the quality of synthetic data, in comparison to the utilization of real data. We compare our CVAE with two state-of-the-art AI-based traffic data generators and prove that, trained with traces emitted by our generative model, a NIDS has a limited F1-score loss compared to training on real data; competing models instead struggle or fail to generate traces that are as effective for NIDS training and as statistically similar to the original. We make the synthetic datasets available in both PCAP and tabular formats, to facilitate the reproducibility of our findings and encourage further exploration in the field of generative AI for networking.
- Research Article
2
- 10.1145/3673660.3655071
- Jun 11, 2024
- ACM SIGMETRICS Performance Evaluation Review
Datasets of labeled network traces are essential for a multitude of machine learning (ML) tasks in networking, yet their availability is hindered by privacy and maintenance concerns, such as data staleness. To overcome this limitation, synthetic network traces can often augment existing datasets. Unfortunately, current synthetic trace generation methods, which typically produce only aggregated flow statistics or a few selected packet attributes, do not always suffice, especially when model training relies on having features that are only available from packet traces. This shortfall manifests in both insufficient statistical resemblance to real traces and suboptimal performance on ML tasks when employed for data augmentation. In this paper, we apply diffusion models to generate high-resolution synthetic network traffic traces. We present NetDiffusion, a tool that uses a finely-tuned, controlled variant of a Stable Diffusion model to generate synthetic network traffic that is high fidelity and conforms to protocol specifications. Our evaluation demonstrates that packet captures generated from NetDiffusion can achieve higher statistical similarity to real data and improved ML model performance than current state-of-the-art approaches (e.g., GAN-based approaches). Furthermore, our synthetic traces are compatible with common network analysis tools and support a myriad of network tasks, suggesting that NetDiffusion can serve a broader spectrum of network analysis and testing tasks, extending beyond ML-centric applications.
- Research Article
40
- 10.1145/3639037
- Feb 16, 2024
- Proceedings of the ACM on Measurement and Analysis of Computing Systems
Datasets of labeled network traces are essential for a multitude of machine learning (ML) tasks in networking, yet their availability is hindered by privacy and maintenance concerns, such as data staleness. To overcome this limitation, synthetic network traces can often augment existing datasets. Unfortunately, current synthetic trace generation methods, which typically produce only aggregated flow statistics or a few selected packet attributes, do not always suffice, especially when model training relies on having features that are only available from packet traces. This shortfall manifests in both insufficient statistical resemblance to real traces and suboptimal performance on ML tasks when employed for data augmentation. In this paper, we apply diffusion models to generate high-resolution synthetic network traffic traces. We present NetDiffusion1, a tool that uses a finely-tuned, controlled variant of a Stable Diffusion model to generate synthetic network traffic that is high fidelity and conforms to protocol specifications. Our evaluation demonstrates that packet captures generated from NetDiffusion can achieve higher statistical similarity to real data and improved ML model performance than current state-of-the-art approaches (e.g., GAN-based approaches). Furthermore, our synthetic traces are compatible with common network analysis tools and support a myriad of network tasks, suggesting that NetDiffusion can serve a broader spectrum of network analysis and testing tasks, extending beyond ML-centric applications.
- Book Chapter
- 10.1016/b978-0-12-817236-0.00011-x
- Jan 1, 2021
- Applied Techniques to Integrated Oil and Gas Reservoir Characterization
Chapter 11 - Wavelet extraction/derivation
- Conference Article
7
- 10.1109/dsaa.2019.00052
- Oct 1, 2019
Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.
- Book Chapter
2
- 10.1002/9781118648759.ch6
- May 13, 2013
This chapter provides a comprehensive survey on the generation of vehicular mobility traces for network simulation. It presents an introduction to the process of generating synthetic mobility traces and outlines the history of synthetic vehicular mobility generation. The chapter provides a taxonomy of the most important tools employed in the process above, that is the microscopic-level simulators that are available today for the generation of vehicular mobility traces. Finally, the outcome of the process, that is the mobility traces that are publicly available for the simulation of vehicular networks, are presented.
- Conference Article
- 10.1109/simsym.1993.639084
- Jan 1, 1993
The objective of this paper is to develop a tool to study the interaction between the cache coherence protocol and the media access protocol for a multiprocessor system in a wavelength division multiplexed starcoupled configuration. This simulation is modular for easy swapping of component functional specifications, and inter-module communication is achieved through a message passing facility. By writing simulations in this way, it is faster and more reliable to create simulations for comparison purposes. This tool is applied to the simulation of an Optically Interconnected DbtributedSharedMemory system. It employs aphotonic network using Wavelength Division Multiple Access to create multiple channels on a single optical fiber as an interconnection network in support of memory accesses. This paper considers two variants of a directory basedcachecoherenceprotocol. A synthetic address trace generation model which is independent of the system architecture and with great flexibility in characterizing various workloads is used as an input to the system.
- Research Article
- 10.2139/ssrn.4643250
- Jan 1, 2023
- SSRN Electronic Journal
Synthetic and Privacy-Preserving Traffic Trace Generation using Generative AI Models for Training Network Intrusion Detection Systems
- Conference Article
- 10.1109/csce60160.2023.00260
- Jul 24, 2023
Traffic jams impact commuters every day. People waste time sitting in traffic, and emissions are also a factor to consider. It is known that making roads bigger and better does not solve congestion, but makes it worse. One of the solutions we can still apply is creating better policies for road use but testing them in the real world can be expensive and dangerous. Thus, we need better traffic simulation strategies that accurately depict reality. We propose traffic prediction by feeding real-world data to machine learning models. Our results show that it is possible with a high degree of confidence. Amongst the selected algorithms we evaluated, XGBoost displayed better performance, whilst fitting in an acceptable amount of time.
- Conference Article
- 10.3997/2214-4609-pdb.144.p14
- Jan 1, 2003
The starting point for seismic interpretation and detailed seismic reservoir characterization is typically a good quality well to seismic tie. A standard approach is to convolve an assumed wavelet with the reflectivity series determined from the acoustic impedance log. The resulting well to seismic tie is in many instances poor. This could be due to seismic processing problems and/or inappropriate calibration of the log data and subsequent synthetic seismic trace generation.