Software products from all vendors have vulnerabilities that can cause a security concern. Malware is used as a prime exploitation tool to exploit these vulnerabilities. Machine learning (ML) methods are efficient in detecting malware and are state-of-art. The effectiveness of ML models can be augmented by reducing false negatives and false positives. In this paper, the performance of bagging and boosting machine learning models is enhanced by reducing misclassification. Shapley values of features are a true representation of the amount of contribution of features and help detect top features for any prediction by the ML model. Shapley values are transformed to probability scale to correlate with a prediction value of ML model and to detect top features for any prediction by a trained ML model. The trend of top features derived from false negative and false positive predictions by a trained ML model can be used for making inductive rules. In this work, the best performing ML model in bagging and boosting is determined by the accuracy and confusion matrix on three malware datasets from three different periods. The best performing ML model is used to make effective inductive rules using waterfall plots based on the probability scale of features. This work helps improve cyber security scenarios by effective detection of false-negative zero-day malware.
Read full abstract