Dynamic Malware Classification and API Categorisation of Windows Portable Executable Files Using Machine Learning

Durre Zehra Syeda,Mamoona Naveed Asghar

doi:10.3390/app14031015

Durre Zehra Syeda, Mamoona Naveed Asghar

Open Access

https://doi.org/10.3390/app14031015

Copy DOI

Abstract

The rise of malware attacks presents a significant cyber-security challenge, with advanced techniques and offline command-and-control (C2) servers causing disruptions and financial losses. This paper proposes a methodology for dynamic malware analysis and classification using a malware Portable Executable (PE) file from the MalwareBazaar repository. It suggests effective strategies to mitigate the impact of evolving malware threats. For this purpose, a five-level approach for data management and experiments was utilised: (1) generation of a customised dataset by analysing a total of 582 malware and 438 goodware samples from Windows PE files; (2) feature extraction and feature scoring based on Chi2 and Gini importance; (3) empirical evaluation of six state-of-the-art baseline machine learning (ML) models, including Logistic Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Random Forest (RF), XGBoost (XGB), and K-Nearest Neighbour (KNN), with the curated dataset; (4) malware family classification using VirusTotal APIs; and, finally, (5) categorisation of 23 distinct APIs from 266 malware APIs. According to the results, Gini’s method takes a holistic view of feature scoring, considering a wider range of API activities. The RF achieved the highest precision of 0.99, accuracy of 0.96, area under the curve (AUC) of 0.98, and F1-score of 0.96, with a 0.93 true-positive rate (TPR) and 0.0098 false-positive rate (FPR), among all applied ML models. The results show that Trojans (27%) and ransomware (22%) are the most risky among 11 malware families. Windows-based APIs (22%), the file system (12%), and registry manipulation (8.2%) showcased their importance in detecting malicious activity in API categorisation. This paper considers a dual approach for feature reduction and scoring, resulting in an improved F1-score (2%), and the inclusion of AUC and specificity metrics distinguishes it from existing research (Section Comparative Analysis with Existing Approaches). The newly generated dataset is publicly available in the GitHub repository (Data Availability Statement) to facilitate aspirant researchers’ dynamic malware analysis.

Full Text