Abstract
Attackers often use an executable file (malware) as a tool to obtain sensitive information from specific companies and individuals. Anti-virus software attempts to detect the malware by pattern matching method etc. However, it is difficult to detect unknown malware in these methods. The unknown malware is detected by a sandbox, etc. We consider another method because the sandbox requires much time for running. ASCII strings extracted from executable files are helpful for analyzing malware. With the recent development of natural language processing (NLP) techniques, it is becoming possible to use these strings as a malware detection method. In this paper, we propose a malware detection method using ASCII strings with NLP techniques. Our method divides these strings into words, and distinguishes the difference of the words between benign and malicious executable files. To compare with the arrangement of words or the frequency of appearing words, uncommon words are unnecessary in NLP techniques. Thus, we consider that reducing the uncommon words improves the detection rate. Our method converts a corpus of frequent words into a feature vector with natural language processing techniques. In our experiments, we used a dataset containing more than 23,000 malware samples (more than 2,100 malware families) provided by FFRI and more than 16,000 benign files collected from download.cnet.com. Our method achieves the F-measure more than 0.85. The experimental results show that our method detects unknown malware with high accuracy.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.