Data Augmentation in Training Deep Learning Models for Malware Family Classification

Ding Yuxin,Ma Yubin,Wang Guangbin,Ding Haoxuan

doi:10.1109/icmlc54886.2021.9737271

Abstract

With the rapid development of deep learning technologies, different deep learning models have been applied to detect and classify malware. When applying deep learning models to classify malware families, a major bottleneck is the lack of enough labeled family samples that are required for training deep learning models. The depth model applied to malware needs a huge number of samples for training. In order to solve this issue, we propose a method for generating malware family samples. We use the Grad-CAM algorithm to find the raw data representing malware features. A new sample is created by inserting them into section gaps and new sections in PE files. The experiment results show that adding the generated samples into training dataset can improve the classification accuracy of deep learning models.

Full Text