Abstract

In previous studies classification algorithms were employed successfully for the detection of unknown malicious code. Most of these studies extracted features based on byte n-gram patterns in order to represent the inspected files. In this study we represent the inspected files using OpCode n-gram patterns which are extracted from the files after disassembly. The OpCode n-gram patterns are used as features for the classification process. The classification process main goal is to detect unknown malware within a set of suspected files which will later be included in antivirus software as signatures. A rigorous evaluation was performed using a test collection comprising of more than 30,000 files, in which various settings of OpCode n-gram patterns of various size representations and eight types of classifiers were evaluated. A typical problem of this domain is the imbalance problem in which the distribution of the classes in real life varies. We investigated the imbalance problem, referring to several real-life scenarios in which malicious files are expected to be about 10% of the total inspected files. Lastly, we present a chronological evaluation in which the frequent need for updating the training set was evaluated. Evaluation results indicate that the evaluated methodology achieves a level of accuracy higher than 96% (with TPR above 0.95 and FPR approximately 0.1), which slightly improves the results in previous studies that use byte n-gram representation. The chronological evaluation showed a clear trend in which the performance improves as the training set is more updated.

Highlights

  • Modern computer and communication infrastructures are highly susceptible to various types of attacks

  • In accordance to these questions, we wanted to identify the best settings of the classification framework which is determined by a combination of: (1) the termrepresentation (TF or TF inverse document frequency (TFIDF)); (2) the OpCode n-gram size (1, 2, 3, 4, 5 or 6); (3) the top-selection of features (50, 100, 200 or 300); (4) the feature selection method (DF, Fisher Score (FS) or Gain Ratio (GR)); and (5) the classifier (SVM, Logistic Regression (LR), Random Forest (RF), Artificial Neural Networks (ANN), Decision Trees (DT), BDT, Naïve Bayes (NB) or BNB)

  • The mean True Positive Rate (TPR), False Positive Rate (FPR), accuracies and G-Means of the term frequency (TF) and the TFIDF were quite identical, which is good because maintaining the TFIDF requires additional computational efforts each time a malcode or benign files are added to the collection

Read more

Summary

Introduction

Modern computer and communication infrastructures are highly susceptible to various types of attacks. Anti-virus vendors are facing huge quantities (thousands) of suspicious files every day [2]. These files are collected from various sources including dedicated honeypots, third party providers and files reported by customers either automatically or explicitly. Our main goal in this study is to be able to filter out unknown malicious files from the files arriving to an anti-virus vendor every day. The authors of [16] were the first to introduce the idea of applying Machine Learning (ML) methods for the detection of different malwares based on their respective binary codes. A new executable file was compared with the profiles of malicious and benign classes, and was assigned to the most similar

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.