Abstract

It is challenging for malware lineage inference to identify versions of collected malware by ensuring high accuracy in clustering. In this article, we tackle this problem and present a novel mechanism using behavioral features for version identification of (un)packed malware. Our basic idea involves focusing on intrafamily clustering. We extract the so-called family feature sets, i.e., hybrid features specific to each family. Our intuition is that family feature sets may achieve higher accuracy in clustering than common feature sets, and unpacked malware found in or relevant to such a cluster can result in the lineage inference of family members using traditional inference methods. We conduct experiments with two datasets, 8928 malware samples from VXHeavens and 3293 samples by manual analysis, composed of packed malware in a large portion. The results demonstrate that we can accurately classify samples into malware families based on the hybrid features we choose. In addition, we can also effectively extract family feature sets from 37 feature categories using forward stepwise selection. For intrafamily clustering, we employed the agglomerative clustering algorithm and observed that using family feature sets is significantly more accurate than using common feature sets, which facilitates higher accuracy lineage inference of packed malware.

Highlights

  • T HERE is a substantial growth in the amount of malware emerging annually

  • We propose a new method of version identification so that we can create compatible inputs for lineage inference from largescale malware datasets

  • In the current malware environment that mostly consists of packed malware samples, our approach plays a crucial role in version identification associated with large-scale lineage inference

Read more

Summary

INTRODUCTION

T HERE is a substantial growth in the amount of malware emerging annually. According to AV-TEST, the number of malware samples reported in 2008 was approximately 10 million, which increased to 127 million in 2015, indicating a 12-fold increase [4]. Most samples are packed, which means that the size of N can be dramatically reduced In this context, version identification is a crucial step for filtering packed malware before performing lineage inference. Clustering groups a version of packed malware and unpacked malware according to behavioral features that can be extracted through dynamic analysis. 1) New version identification system: We propose an integrated system that includes feature processing, family classification, and intrafamily clustering for malware version identification. Feature sets can improve the accuracy of intrafamily clustering, i.e., version identification. Intrafamily clustering based on family feature sets results in an F1-score of about 90%, which indicates a considerable increase from prior version identification studies, e.g., 70% approximately.

BACKGROUND
System Overview
Feature Processing
Version Identification
Overview
Feature Extraction
VERSION IDENTIFICATION
Family Classification
Agglomerative Clustering
Forward Stepwise Selection
Cluster Head Selection
Dataset
Packers
Intrafamily Clustering
Practical Impact
Limitations and Future Work
Malware Lineage Inference and Version Identification
Machine-Learning-Based Malware Analysis
Feature Engineering and Feature Selection
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call