Abstract

Dynamic analysis and pattern matching techniques are widely used in industry, and they provide a straightforward method for the identification of malware samples. Yara is a pattern matching technique that can use sandbox memory dumps for the identification of malware families. However, pattern matching techniques fail silently due to minor code variations, leading to unidentified malware samples. This paper presents a two-layered Malware Variant Identification using Incremental Clustering (MVIIC) process and proposes clustering of unidentified malware samples to enable the identification of malware variants and new malware families. The novel incremental clustering algorithm is used in the identification of new malware variants from the unidentified malware samples. This research shows that clustering can provide a higher level of performance than Yara rules, and that clustering is resistant to small changes introduced by malware variants. This paper proposes a hybrid approach, using Yara scanning to eliminate known malware, followed by clustering, acting in concert, to allow the identification of new malware variants. F1 score and V-Measure clustering metrics are used to evaluate our results.

Highlights

  • This paper provides a technique called Malware Variant Identification, using Incremental Clustering (MVIIC)

  • While the clustering of dynamic analysis features has previously been used for malware detection [4,5], this paper proposes a hybrid scheme using Yara rules, and a novel incremental clustering algorithm [6] to enable the identification of new malware families and malware variants

  • This paper proposes a two-layered MVIIC technique that makes use of a novel incremental clustering algorithm to support Yara-based malware family identification by the clustering of unidentified malware variants

Read more

Summary

Introduction

This paper provides a technique called Malware Variant Identification, using Incremental Clustering (MVIIC). A sandbox is an instrumented virtual machine that executes malware samples and other programs, and gathers dynamic analysis features resulting from program execution. These features include filesystem activity, registry activity, network traffic, program execution, and code injection. Yara rules may fail to identify new malware variants when software development modifies code corresponding to the Yara regular expressions, or when program strings are changed. These unidentified malware samples provide a valuable source of unknown malware families and new malware variants

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call