Ensemble Malware Classification System Using Deep Neural Networks

Barath Narayanan Narayanan,Venkata Salini Priyamvada Davuluru

doi:10.3390/electronics9050721

Barath Narayanan Narayanan, Venkata Salini Priyamvada Davuluru

Open Access

PDF Available

https://doi.org/10.3390/electronics9050721

Copy DOI

Export

Save

Cite

Journal: Electronics	Publication Date: Apr 27, 2020
Citations: 32	License type: CC BY 4.0

Affiliation: University of Dayton

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

With the advancement of technology, there is a growing need of classifying malware programs that could potentially harm any computer system and/or smaller devices. In this research, an ensemble classification system comprising convolutional and recurrent neural networks is proposed to distinguish malware programs. Microsoft’s Malware Classification Challenge (BIG 2015) dataset with nine distinct classes is utilized for this study. This dataset contains an assembly file and a compiled file for each malware program. Compiled files are visualized as images and are classified using Convolutional Neural Networks (CNNs). Assembly files consist of machine language opcodes that are distinguished among classes using Long Short-Term Memory (LSTM) networks after converting them into sequences. In addition, features are extracted from these architectures (CNNs and LSTM) and are classified using a support vector machine or logistic regression. An accuracy of 97.2% is achieved using LSTM network for distinguishing assembly files, 99.4% using CNN architecture for classifying compiled files and an overall accuracy of 99.8% using the proposed ensemble approach thereby setting a new benchmark. An independent and automated classification system for assembly and/or compiled files provides the luxury to anti-malware industry experts to choose the type of system depending on their available computational resources.

Highlights

Classifying malware programs into different categories based on their pattern has been a research area attracting great interest for several years [1]
An accuracy of 97.2% is achieved using Long Short-Term Memory (LSTM) network for distinguishing assembly files, 99.4% using Convolutional Neural Networks (CNNs) architecture for classifying compiled files and an overall accuracy of 99.8% using the proposed ensemble approach thereby setting a new benchmark
We present a novel approach to utilizing CNNs and LSTMs as feature extractors instead of classification tools to combat class imbalance and limited training images present in the dataset

Summary

Introduction

Classifying malware programs into different categories based on their pattern has been a research area attracting great interest for several years [1]. Malware programs can either be present in the form of assembly files or binary files or even both in a computer or any other electronic device such as mobile phones and laptops. Anti-malware industries have remedial measures after associating a given malware program with a particular category. Identifying a particular category of malware program can be difficult and extensive due to polymorphism and huge file size. Hackers introduce the concept of polymorphism to represent malware programs in different forms and sizes to make it difficult for the anti-malware industry to classify or identify such files.

Methods

Results

Conclusion