Abstract

In spear-phishing attacks, macro malware written in VBA (Visual Basic for Applications) is often used to compromise the target computers. Macro malware is often obfuscated in several ways to evade detection. To detect new macro malware, several methods with machine learning techniques have been proposed. While many methods were evaluated with the inadequate or balanced dataset with the same number of benign and malicious samples, practical performance is still open to discussion. In reality, the population of VBA macros consists of wide variety of samples. To evaluate practical performance, an imbalanced dataset which contains many benign samples is required. In this paper, we propose an improved method of detecting macro malware on an imbalanced dataset. Our method uses 2 language models (Doc2vec and Latent Semantic Indexing (LSI)) and 4 popular classifiers. These language models are used to extract features and mitigate the class imbalance problem by selecting important features. We create an imbalanced dataset with more than 30,000 samples and evaluate the practical performance. The experimental result demonstrates that our method mitigates the class imbalance problem and could detect completely new malware regardless of the family type. The result also reveals that LSI is more robust than Doc2vec to the class imbalance problem.

Highlights

  • Spear-phishing attacks are one of main threats for organizations of all sizes and across every field

  • While many studies focus on Portable Document Format (PDF) document files [2]–[8] or their JavaScript [9]–[11], this study focuses on Microsoft (MS) document files

  • STRUCTURE This paper proposes an improved method of detecting macro malware on an imbalanced dataset

Read more

Summary

Introduction

Spear-phishing attacks are one of main threats for organizations of all sizes and across every field. To detect new macro malware, several methods with machine learning models have been proposed [14]–[19]. These methods are evaluated with a balanced dataset with the same number of benign and malicious samples. An imbalanced dataset which contains many benign samples is required [24]. The experimental result demonstrates that our method mitigates the class imbalance problem and could detect new malware families.

Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.