Easy Data Augmentation for Improved Malware Detection: A Comparative Study

Jangseong Bae,Changki Lee

doi:10.1109/bigcomp51126.2021.00048

Abstract

Artificial data generation is important for improving research outcomes when using deep learning. As one of the most popular and promising generative models, the variational auto-encoder (VAE) model generates synthetic data for training classifiers more accurately. Artificial data can be generated also via easy data augmentation (EDA) techniques. EDA is a simple method used to boost the performance of text classification tasks, and unlike generative models such as VAE, it does not require model training. Malware detection is a task of determining whether there is malicious software in the host system and diagnosing the type of attack. Without an appropriate amount of training data, the detection efficiency of malicious programs decreases. In this study, EDA was applied to malware detection, and two artificial data generation methods were compared. Using both methods, artificial training data to be used for malware detection were generated, and the long short-term memory recurrent neural network (LSTM RNN) based malware detection classifier was boosted. Experiment results show that when the synthetic malware sample generated by EDA was added to the training data, the accuracy of LSTM RNN classifier improved by 1.76% as compared to the 0.98% improvement by VAE. In addition, EDA could generate malware training data, without requiring a separate training process, 10 times faster than VAE. Further, we performed extensive ablation studies conducted and suggested parameters for practical use.

Full Text