An Adaptive Behavioral-Based Incremental Batch Learning Malware Variants Detection Model Using Concept Drift Detection and Sequential Deep Learning

Abdulbasit A Darem,Afrah Y Al-Rezami,Jemal H Abawajy,Fuad A Ghaleb,Sultan M Alanazi,Asma A Al-Hashmi

doi:10.1109/access.2021.3093366

Abstract

Malware variants are the major emerging threats that face cybersecurity due to the potential damage to computer systems. Many solutions have been proposed for detecting malware variants. However, accurate detection is challenging due to the constantly evolving nature of the malware variants that cause concept drift. Existing malware detection solutions assume that the mapping learned from historical malware features will be valid for new and future malware. The relationship between input features and the class label has been considered stationary, which doesn't hold for the ever-evolving nature of malware variants. Malware features change dynamically due to code obfuscations, mutations, and the modification made by malware authors to change the features' distribution and thus evade the detection rendering the detection model obsolete and ineffective. This study presents an Adaptive behavioral-based Incremental Batch Learning Malware Variants Detection model using concept drift detection and sequential deep learning (AIBL-MVD) to accommodate the new malware variants. Malware behaviors were extracted using dynamic analysis by running the malware files in a sandbox environment and collecting their Application Programming Interface (API) traces. According to the malware first-time appearance, the malware samples were sorted to capture the malware variants' change characteristics. The base classifier was then trained based on a subset of historical malware samples using a sequential deep learning model. The new malware samples were mixed with a subset of old data and gradually introduced to the learning model in an adaptive batch size incremental learning manner to address the catastrophic forgetting dilemma of incremental learning. The statistical process control technique has been used to detect the concept drift as an indication for incrementally updating the model as well as reducing the frequency of model updates. Results from extensive experiments show that the proposed model is superior in terms of detection rate and efficiency compared with the static model, periodic retraining approaches, and the fixed batch size incremental learning approach. The model maintains an average of 99.41% detection accuracy of new and variants malware with a low updating frequency of 1.35 times per month.

Highlights

Malware threats have been dramatically increased due to the increasing of internet users, proliferation of malware creation tools and the use of obfuscation techniques by malware authors [1, 2]
The Application Programming Interface (API) calls sequences that were extracted from the dynamic analysis and enriched by the n-gram model, were introduced to the feature extraction method to reduce the complexity of the detection model concerning the time and resources
In this study, the concept drift problem case by malware variants is addressed by presenting an adaptive batch incremental deep learning model for improving the accuracy of malware variants detection

Summary

Introduction

Malware threats have been dramatically increased due to the increasing of internet users, proliferation of malware creation tools and the use of obfuscation techniques by malware authors [1, 2]. Malware is a general term that refers to malicious software or any unwanted software which is VOLUME XX, 2017. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Malware can be found in many types such as viruses, worms, botnet, backdoors, trojan horses, ransomware, rootkit among many other families [3]. Each type of malware attacks and functions differently. The consequence of malware varies according to the type of malware, type of the infected target, and the purpose of the attacks [4, 5]

Methods

Results

Conclusion