Detecting unknown malicious code by applying classification techniques on OpCode patterns

Asaf Shabtai,Shlomi Dolev,Robert Moskovitch,Clint Feher,Yuval Elovici

doi:10.1186/2190-8532-1-1

Asaf Shabtai, Shlomi Dolev + Show 3 more

Open Access

https://doi.org/10.1186/2190-8532-1-1

Copy DOI

Journal: Security Informatics	Publication Date: Feb 27, 2012
Citations: 268	License type: CC BY 2.0

Affiliation: Ben-Gurion University of the Negev

Abstract

In previous studies classification algorithms were employed successfully for the detection of unknown malicious code. Most of these studies extracted features based on byte n-gram patterns in order to represent the inspected files. In this study we represent the inspected files using OpCode n-gram patterns which are extracted from the files after disassembly. The OpCode n-gram patterns are used as features for the classification process. The classification process main goal is to detect unknown malware within a set of suspected files which will later be included in antivirus software as signatures. A rigorous evaluation was performed using a test collection comprising of more than 30,000 files, in which various settings of OpCode n-gram patterns of various size representations and eight types of classifiers were evaluated. A typical problem of this domain is the imbalance problem in which the distribution of the classes in real life varies. We investigated the imbalance problem, referring to several real-life scenarios in which malicious files are expected to be about 10% of the total inspected files. Lastly, we present a chronological evaluation in which the frequent need for updating the training set was evaluated. Evaluation results indicate that the evaluated methodology achieves a level of accuracy higher than 96% (with TPR above 0.95 and FPR approximately 0.1), which slightly improves the results in previous studies that use byte n-gram representation. The chronological evaluation showed a clear trend in which the performance improves as the training set is more updated.

Highlights

Modern computer and communication infrastructures are highly susceptible to various types of attacks
In accordance to these questions, we wanted to identify the best settings of the classification framework which is determined by a combination of: (1) the termrepresentation (TF or TF inverse document frequency (TFIDF)); (2) the OpCode n-gram size (1, 2, 3, 4, 5 or 6); (3) the top-selection of features (50, 100, 200 or 300); (4) the feature selection method (DF, Fisher Score (FS) or Gain Ratio (GR)); and (5) the classifier (SVM, Logistic Regression (LR), Random Forest (RF), Artificial Neural Networks (ANN), Decision Trees (DT), BDT, Naïve Bayes (NB) or BNB)
The mean True Positive Rate (TPR), False Positive Rate (FPR), accuracies and G-Means of the term frequency (TF) and the TFIDF were quite identical, which is good because maintaining the TFIDF requires additional computational efforts each time a malcode or benign files are added to the collection

Summary

Introduction

Modern computer and communication infrastructures are highly susceptible to various types of attacks. Anti-virus vendors are facing huge quantities (thousands) of suspicious files every day [2]. These files are collected from various sources including dedicated honeypots, third party providers and files reported by customers either automatically or explicitly. Our main goal in this study is to be able to filter out unknown malicious files from the files arriving to an anti-virus vendor every day. The authors of [16] were the first to introduce the idea of applying Machine Learning (ML) methods for the detection of different malwares based on their respective binary codes. A new executable file was compared with the profiles of malicious and benign classes, and was assigned to the most similar

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Detecting unknown malicious code by applying classification techniques on OpCode patterns

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Security Informatics

Lead the way for us

Similar Papers

A Chronological Evaluation of Unknown Malcode Detection
Robert Moskovitch ... Clint Feher
-
Robert Moskovitch, et. al.Robert Moskovitch ... Clint Feher
01 Jan 2009
01 Jan 2009

Unknown malcode detection — A chronological evaluation
Robert Moskovitch ... Yuval Elovici
-
Robert Moskovitch, et. al.Robert Moskovitch ... Yuval Elovici
01 Jun 2008
01 Jun 2008

Unknown Malcode Detection Using OPCODE Representation
Robert Moskovitch ... Nir Tzachar
-
Robert Moskovitch, et. al.Robert Moskovitch ... Nir Tzachar
01 Jan 2008
01 Jan 2008

Unknown malcode detection and the imbalance problem
Robert Moskovitch ... Nathalie Japkowicz
Journal in Computer Virology | VOL. 5
Robert Moskovitch, et. al.Robert Moskovitch ... Nathalie Japkowicz
11 Jul 2009
Journal in Computer Virology | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Detecting unknown malicious code by applying classification techniques on OpCode patterns

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Security Informatics