An efficient two-stage pipeline model with filtering algorithm for mislabeled malware detection

Pushkar Kishore,Swadhin Kumar Barisal,Durga Prasad Mohapatra,Rajib Mall

doi:10.1016/j.cose.2023.103499

Abstract

Most malware detectors rely on machine-learning approaches. However, new malware samples are growing very fast day by day, and their labeling is very expensive. Consequently, companies cannot label each sample submitted for analysis. Instead, they label certain samples and treat the remaining as benign. Unfortunately, malicious samples are mislabeled and become noisy inputs. Nevertheless, the accuracy of the detection model is diminished by the bias introduced by mislabeling. Even obfuscation and adversarial attacks make malware detection difficult sometimes. These challenges motivate us to propose a novel two-stage pipeline technique based on two distinct sample behavior specifiers. First, using dynamic analysis tools, system calls, and application programming interface (API) call traces of samples are collected. Dynamic analysis is selected since it resolves the obfuscation issue significantly compared to static analysis. Second, API and system call traces are translated into a bag of n-grams or a sequence of anomaly scores. Then, the struc2vec algorithm is used to select the best n-gram features. The sequence of anomaly scores is transformed into six meta-features. A two-stage pipeline is proposed to detect the mislabeled malware and protect the host from Black-Box attacks. Each pipeline stage has a filtering algorithm for identifying the real label of any mislabeled malware. Experimental results on the VirusTotal and Ember datasets indicate that our proposed model achieves a 0.79 Matthews Correlation Coefficient (MCC) and a 0.62 MCC for classifying over 18,000 mislabeled malware samples, respectively. Also, our proposed model performs better in comparison to the existing models.

Full Text