Commercial anti-virus software traditionally memorizes specific byte sequences (known as “signatures”) in the file contents of previously encountered malware. However, malware authors can evade signature-based detection in many ways; for instance, by using obfuscation techniques such as “packing” (encryption or compression) to hide snippets of malicious code; by writing metamorphic malware; or by tampering with existing malware. We hypothesize that certain evasion techniques can leave traces in the file’s entropy signal, revealing either similarities to known malware or the presence of tampering per se. To this end, we present SUSPEND (SUSPicious ENtropy signal Detector), an expert system which evaluates the suspiciousness of an executable file’s entropy signal in order to subserve malware classification. Whereas traditionally, entropy analysis has been used for the goal of packer detection (and therefore entropy-based features often merely comprise mean entropy or the entropy of a few file subcomponents), SUSPEND applies non-stationary time series modeling to aid in malware detection. In particular, SUSPEND (a) quantifies the “amount of structure” in the entropy signal (through detrended fluctuation analysis), (b) finds the location and size of sudden jumps in entropy (through mean change point modeling), and (c) computes the distribution of entropic variation across multiple spatial scales (through wavelet decomposition). In addition, SUSPEND (d) summarizes the entropy signal’s empirical probability distribution. Because SUSPEND’s run time can be made to scale linearly in file size, it is well-suited for large-scale malware analysis. We apply SUSPEND to a large-scale malware detection task with 500,000 heterogeneous real-world samples and over 1 million features. We find that SUSPEND boosts the predictive performance of traditional entropy analysis (as found in packer detectors) from 77.02% to 96.62%. Moreover, SUSPEND’s focus on entropy signals makes it a natural candidate for combining with other types of features; for instance, combining SUSPEND with a strings-based feature set boosts predictive accuracy from 97.18% to 98.62%. Thus, whereas traditionally, entropy analysis has focused on detecting that a file is packed, SUSPEND’s more comprehensive representation of the entropy signal helps to determine that a file is malicious. We illustrate the application of SUSPEND by studying 18 pieces of VirRansom, a family of viral ransomware which could cost millions to large organizations. SUSPEND is able to detect 100% of the studied files with over 99% confidence, whereas a more traditional strings-based model was very close to undecided and represents the entire family with a single string.
Read full abstract