Abstract

Matching malware variants in the same malware family has always been a significant challenge for Cyber Threat Intelligence (CTI). For zero-day malware that does not belong to an existing family, a timely matching of its variants is essential for effective threat tracing and prompt response to the cyber incident. However, malware variants are of diverse forms that make them difficult to match. Additionally, the information extracted from a given malware sample is inaccurate, especially on zero-day malware. Existing malware solutions only focus on detecting known malware or find if two samples are similar without creating any reusable representation of the samples. In this paper, we propose the first practical and efficient solution for zero-day malware variant matching with reconstruction. By combining multi-modality learning and a Siamese-based structure, our model can navigate across different modalities and match zero-day variants. To address the missing or noisy modality issue, we propose a Conditional Variable Autoencoder with a Generative Adversarial Network for heightened resolution. We trained the model on 100,000 malware triplet pairs. Our experiments on real-world noisy samples show that the model out-performs the state-of-the-art and can accurately match not only zero-day malware, but also out-of-sample benign binaries of the same category.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call