Birds of a Feature: Intrafamily Clustering for Version Identification of Packed Malware

Leo Hyun Park,Hong-Koo Kang,Jungbeen Yu,Taejin Lee,Taekyoung Kwon

doi:10.1109/jsyst.2019.2960076

Leo Hyun Park, Hong-Koo Kang + Show 3 more

Open Access

https://doi.org/10.1109/jsyst.2019.2960076

Copy DOI

Abstract

It is challenging for malware lineage inference to identify versions of collected malware by ensuring high accuracy in clustering. In this article, we tackle this problem and present a novel mechanism using behavioral features for version identification of (un)packed malware. Our basic idea involves focusing on intrafamily clustering. We extract the so-called family feature sets, i.e., hybrid features specific to each family. Our intuition is that family feature sets may achieve higher accuracy in clustering than common feature sets, and unpacked malware found in or relevant to such a cluster can result in the lineage inference of family members using traditional inference methods. We conduct experiments with two datasets, 8928 malware samples from VXHeavens and 3293 samples by manual analysis, composed of packed malware in a large portion. The results demonstrate that we can accurately classify samples into malware families based on the hybrid features we choose. In addition, we can also effectively extract family feature sets from 37 feature categories using forward stepwise selection. For intrafamily clustering, we employed the agglomerative clustering algorithm and observed that using family feature sets is significantly more accurate than using common feature sets, which facilitates higher accuracy lineage inference of packed malware.

Highlights

T HERE is a substantial growth in the amount of malware emerging annually
We propose a new method of version identification so that we can create compatible inputs for lineage inference from largescale malware datasets
In the current malware environment that mostly consists of packed malware samples, our approach plays a crucial role in version identification associated with large-scale lineage inference

Summary

INTRODUCTION

T HERE is a substantial growth in the amount of malware emerging annually. According to AV-TEST, the number of malware samples reported in 2008 was approximately 10 million, which increased to 127 million in 2015, indicating a 12-fold increase [4]. Most samples are packed, which means that the size of N can be dramatically reduced In this context, version identification is a crucial step for filtering packed malware before performing lineage inference. Clustering groups a version of packed malware and unpacked malware according to behavioral features that can be extracted through dynamic analysis. 1) New version identification system: We propose an integrated system that includes feature processing, family classification, and intrafamily clustering for malware version identification. Feature sets can improve the accuracy of intrafamily clustering, i.e., version identification. Intrafamily clustering based on family feature sets results in an F1-score of about 90%, which indicates a considerable increase from prior version identification studies, e.g., 70% approximately.

BACKGROUND

System Overview

Feature Processing

Version Identification

Overview

Feature Extraction

VERSION IDENTIFICATION

Family Classification

Agglomerative Clustering

Forward Stepwise Selection

Cluster Head Selection

Dataset

Packers

Intrafamily Clustering

Practical Impact

Limitations and Future Work

Malware Lineage Inference and Version Identification

Machine-Learning-Based Malware Analysis

Feature Engineering and Feature Selection

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE systems journal	Publication Date: Sep 1, 2020
Citations: 40	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Birds of a Feature: Intrafamily Clustering for Version Identification of Packed Malware

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE systems journal

Lead the way for us

Similar Papers

Efficient Pause Extraction and Encode Strategy for Alzheimer's Disease Detection Using Only Acoustic Features from Spontaneous Speech.
Jiamin Liu ... Liang Li
Brain sciences | VOL. 13
Jiamin Liu, et. al.Jiamin Liu ... Liang Li
11 Mar 2023
Brain sciences | VOL. 13

Learning efficient facial landmark model for human attractiveness analysis
Tianhao Peng ... David Zhang
Pattern Recognition | VOL. 138
Tianhao Peng, et. al.Tianhao Peng ... David Zhang
07 Feb 2023
Pattern Recognition | VOL. 138

A modified Baum-Welch algorithm for hidden Markov models with multiple observation spaces
P.M Baggenstoss
-
P.M BaggenstossP.M Baggenstoss
05 Jun 2000
05 Jun 2000

Evaluating Standard Feature Sets Towards Increased Generalisability and Explainability of ML-Based Network Intrusion Detection
Mohanad Sarhan ... Siamak Layeghy
Big Data Research | VOL. 30
Mohanad Sarhan, et. al.Mohanad Sarhan ... Siamak Layeghy
01 Nov 2022
Big Data Research | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Birds of a Feature: Intrafamily Clustering for Version Identification of Packed Malware

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE systems journal