Abstract

With the rapid advances of anti-virus and anti-tracking technologies, three aspects in malware clustering need to be improved for effective clustering, i.e., the robustness of features, the accuracy of similarity measurements, and the effectiveness of clustering algorithms. In this paper, we propose a novel malware family clustering approach based on dynamic and static features with their weights. In this approach, we employ a new similarity measurement method based on EMD to improve the accuracy of feature similarities. In addition, to reduce convergence time and improve clustering purity, we design a novel semi-supervised clustering algorithm, termed as S-DBSCAN by involving supervision information into the original algorithm known as Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The experimental results demonstrate that the proposed approach can correctly and accurately distinguish the samples among various families and achieve outperformed purity with 98.7%.

Highlights

  • An increasingly important problem of malware analysis is the large number of new malware samples

  • New malware variants are emerging at a rate that far exceeds the capability of manual analysis

  • Similarity measurement: We propose a new similarity measurement based on EMD for malware similarity calculation

Read more

Summary

INTRODUCTION

An increasingly important problem of malware analysis is the large number of new malware samples. Pitolli et al [15] used the BIRCH clustering algorithm for clustering with the dynamic and static features extracted by the Cuckoo Sandbox [48] and evaluated the ground truths of the sample labels by two different approaches. These current clustering algorithms applied in malware clustering are mostly supervised learning algorithms, such as hierarchical clustering algorithms and density clustering algorithms. The EMD is the distance between the two distributions, which is calculated by considering the feature and its weight As a result, it is very suitable for the malware family clustering.

EVALUATION INDICATOR
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.