Optimal feature configuration for dynamic malware detection

David Escudero García,Noemí Decastro-García

doi:10.1016/j.cose.2021.102250

Abstract

Applying machine learning techniques to malware detection is a common approach to try to overcome the limitations of signature-based methods. However, it is difficult to engineer a set of features that characterizes the samples properly, especially when various file types may be a vector of infection. In this work, we configure several feature sets for dynamic malware detection extracted from API calls, including an alternative scheme grouping calls in categories, network activity, signatures from the Cuckoo sandbox report, and some interactions with the file system and registry. We test combinations of these feature sets to ascertain whether they are good enough to distinguish between benign and malicious samples from a dataset containing several file types, obtained from public sources. We apply statistical inference to measure the differences in the performance between the feature sets, and the hyperparameter optimization algorithms applied to construct the models. We also unbalance the datasets to evaluate the model performance on more realistic scenarios in which not many malware samples are available. Although all studied feature configurations provide accuracies greater than 0.98, and several of them a Matthews correlation coefficient greater than 0.95 in the unbalanced datasets, statistically meaningful differences appear, so we analyze the results to determine which is the optimal set of features. We obtain a model that achieves an accuracy of 0.9937 in the balanced dataset and a Matthews correlation coefficient of 0.964 in the unbalanced dataset with 5% of malware.

Full Text