Abstract

Malware classification plays an important role in tracing the attack sources of computer security. However, existing static analysis methods are fast in classification, but they are inefficient in some malware using packing and obfuscation techniques; the dynamic analysis methods have better universality for packing and obfuscation, but they will cause excessive classification cost. To overcome these shortcomings, in this paper, we propose a classification system Malscore based on the probability scoring and machine learning, which sets the probability threshold to concatenate static analysis (called Phase 1) and dynamic analysis (called Phase 2). The convolutional neural networks with spatial pyramid pooling were used to analyze the grayscale images (static features) in Phase 1, and the variable n-grams and machine learning were used to analyze the native API call sequences (dynamic features) in Phase 2. Malscore combined static analysis with dynamic analysis not only accelerated the static analysis process by taking advantage of the CNN in image recognition but also appeared to be more resilient to obfuscation by the dynamic analysis. Different from other static and dynamic analysis techniques, when malware is detected, due to the fact that malware will most likely be labeled only by static analysis, we could reduce the overheads by dynamically analyzing a few malware that has less obvious features or greater confusion in static analysis. We performed experiments on 174607 malware samples from 63 malware families. The result showed that Malscore achieved 98.82% accuracy for malware classification. Furthermore, Malscore was compared with the method of using static and dynamic analysis. The preprocessing and test time represented a reduction of 59.58% and 61.70%, respectively.

Highlights

  • The emergence of various automated tools has shown that the speed with which malware mutates on the Internet is far faster than people realized

  • Most malware can be classified by analyzing static features, but the proliferation of the packing and obfuscation techniques facilitates the creation of malware with consistent behavior and inconsistent static features

  • We propose a malware classification system Malscore based on probability scoring and machine learning

Read more

Summary

INTRODUCTION

The emergence of various automated tools has shown that the speed with which malware mutates on the Internet is far faster than people realized. We use probability scoring to filter out most malware that get reliable classification results in classifier S, and only input unreliable malware into classifier D Through this method, the execution times of dynamic analysis is reduced, and the detection cost of Malscore is reduced. The CNN with SPP layer is used to analyze grayscale image (static feature) in Phase 1, and the variable n-grams and machine learning are used to analyze native API call sequence (dynamic features) in Phase 2. V. ANALYSIS OF NATIVE API CALL SEQUENCES USING VARIABLE N-GRAMS AND MACHINE LEARNING For the grayscale images generated in Section IV-A, there may be some samples of the same family whose static features are not very obvious. 3: Traversal APISequence, APIConcall ← one native API call or native API call subsequences that are called 4 times or more continuously

11: Delete repetitive native API calls in APIConcall in the family
EXPERIMENTS AND RESULTS
EVALUATION OF N-GRAMS AND MACHINE LEARNING
LIMITATIONS
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call