Synthetic flow-based cryptomining attack generation through Generative Adversarial Networks

Alberto Mozo,Ángel González-Prieto,Antonio Pastor,Sandra Gómez-Canaval,Edgar Talavera

doi:10.1038/s41598-022-06057-2

Abstract

Due to the growing rise of cyber attacks in the Internet, the demand of accurate intrusion detection systems (IDS) to prevent these vulnerabilities is increasing. To this aim, Machine Learning (ML) components have been proposed as an efficient and effective solution. However, its applicability scope is limited by two important issues: (i) the shortage of network traffic data datasets for attack analysis, and (ii) the data privacy constraints of the data to be used. To overcome these problems, Generative Adversarial Networks (GANs) have been proposed for synthetic flow-based network traffic generation. However, due to the ill-convergence of the GAN training, none of the existing solutions can generate high-quality fully synthetic data that can totally substitute real data in the training of ML components. In contrast, they mix real with synthetic data, which acts only as data augmentation components, leading to privacy breaches as real data is used. In sharp contrast, in this work we propose a novel and deterministic way to measure the quality of the synthetic data produced by a GAN both with respect to the real data and to its performance when used for ML tasks. As a by-product, we present a heuristic that uses these metrics for selecting the best performing generator during GAN training, leading to a novel stopping criterion, which can be applied even when different types of synthetic data are to be used in the same ML task. We demonstrate the adequacy of our proposal by generating synthetic cryptomining attacks and normal traffic flow-based data using an enhanced version of a Wasserstein GAN. The results evidence that the generated synthetic network traffic can completely replace real data when training a ML-based cryptomining detector, obtaining similar performance and avoiding privacy violations, since real data is not used in the training of the ML-based detector.

Highlights

We propose a Wasserstein GAN (WGAN) architecture to generate synthetic flow-based network traffic that can fully replace real traffic with two complementary goals: (1) avoiding privacy breaches when sharing data with third parties or deploying data augmentation solutions and (2) obtaining a nearly unlimited source of synthetic data that is similar to the real data from a statistical perspective and can be utilised to fully substitute real data in Machine Learning (ML) training processes while keeping the same performance as ML models trained with real data
Each type of traffic was modeled with two different WGAN architectures and using a rich set of hyperparameters we run an extensive number of WGAN trainings
Several enhancements were proposed to improve the simple WGAN performance: a custom activation function to better adapt the generator output to the data domain, and several heuristics to apply during Generative Adversarial Networks (GANs) training to adapt the learning speeds of the generator and the discriminator

Summary

Objectives

In contrast to most of the proposed works that are based on data augmentation solutions, we aim to generate synthetic data that can fully replace real data. Regarding that companies are reluctant to share their data with third party developers and governments concerned about privacy issues are imposing limitations to telecom providers for accessing user data or inspecting packet payload, the aim of this work is to construct a generative model able to substitute these data belonging to real clients by fully synthetic data with no information of the original clients. In these experiments we aim to prove that synthetically generated data can be utilised for totally substituting real data in ML training processes while keeping the same performance than ML models trained with real. We aim to apply some type of elitism that prevents a drastic fall of the performance due to a random choice of a bad generator

Methods

Results

Conclusion