Abstract
It is challenging for malware lineage inference to identify versions of collected malware by ensuring high accuracy in clustering. In this article, we tackle this problem and present a novel mechanism using behavioral features for version identification of (un)packed malware. Our basic idea involves focusing on intrafamily clustering. We extract the so-called family feature sets, i.e., hybrid features specific to each family. Our intuition is that family feature sets may achieve higher accuracy in clustering than common feature sets, and unpacked malware found in or relevant to such a cluster can result in the lineage inference of family members using traditional inference methods. We conduct experiments with two datasets, 8928 malware samples from VXHeavens and 3293 samples by manual analysis, composed of packed malware in a large portion. The results demonstrate that we can accurately classify samples into malware families based on the hybrid features we choose. In addition, we can also effectively extract family feature sets from 37 feature categories using forward stepwise selection. For intrafamily clustering, we employed the agglomerative clustering algorithm and observed that using family feature sets is significantly more accurate than using common feature sets, which facilitates higher accuracy lineage inference of packed malware.
Highlights
T HERE is a substantial growth in the amount of malware emerging annually
We propose a new method of version identification so that we can create compatible inputs for lineage inference from largescale malware datasets
In the current malware environment that mostly consists of packed malware samples, our approach plays a crucial role in version identification associated with large-scale lineage inference
Summary
T HERE is a substantial growth in the amount of malware emerging annually. According to AV-TEST, the number of malware samples reported in 2008 was approximately 10 million, which increased to 127 million in 2015, indicating a 12-fold increase [4]. Most samples are packed, which means that the size of N can be dramatically reduced In this context, version identification is a crucial step for filtering packed malware before performing lineage inference. Clustering groups a version of packed malware and unpacked malware according to behavioral features that can be extracted through dynamic analysis. 1) New version identification system: We propose an integrated system that includes feature processing, family classification, and intrafamily clustering for malware version identification. Feature sets can improve the accuracy of intrafamily clustering, i.e., version identification. Intrafamily clustering based on family feature sets results in an F1-score of about 90%, which indicates a considerable increase from prior version identification studies, e.g., 70% approximately.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have