Combat Mobile Evasive Malware via Skip-Gram-Based Malware Detection

Alper Egitmen,Omer Seyrekbasan,A Gokhan Yavuz,A Bilge Gunduz,Irfan Bulut,R Can Aygun

doi:10.1155/2020/6726147

Alper Egitmen, Omer Seyrekbasan + Show 4 more

Open Access

https://doi.org/10.1155/2020/6726147

Copy DOI

Abstract

Android malware detection is an important research topic in the security area. There are a variety of existing malware detection models based on static and dynamic malware analysis. However, most of these models are not very successful when it comes to evasive malware detection. In this study, we aimed to create a malware detection model based on a natural language model called skip-gram to detect evasive malware with the highest accuracy rate possible. In order to train and test our proposed model, we used an up-to-date malware dataset called Argus Android Malware Dataset (AMD) since the AMD contains various evasive malware families and detailed information about them. Meanwhile, for the benign samples, we used Comodo Android Benign Dataset. Our proposed model starts with extracting skip-gram-based features from instruction sequences of Android applications. Then it applies several machine learning algorithms to classify samples as benign or malware. We tested our proposed model with two different scenarios. In the first scenario, the random forest-based classifier performed with 95.64% detection accuracy on the entire dataset and 95% detection accuracy against evasive only samples. In the second scenario, we created a test dataset that contained zero-day malware samples only. For the training set, we did not use any sample that belongs to the malware families in the test set. The random forest-based model performed with 37.36% accuracy rate against zero-day malware. In addition, we compared our proposed model’s malware detection performance against several commercial antimalware applications using VirusTotal API. Our model outperformed 7 out of 10 antimalware applications and tied with one of them on the same test scenario.

Highlights

Advancements in mobile device technology led developers to make rich content applications for different purposes such as social media, health care, finance, and government
For both of the models, random forest (RF) algorithm gave the best precision with 97% and 97.4%, respectively. e 1999 instances from the left-out malware families were tested for malwareness using the RFbased model from the second scenario
We compared our model’s performance to the performances of top 11 reliable antimalware software applications. As these antimalware software use accuracy as the performance metric, we compared our accuracy value to their corresponding values. e results of the comparison are given in Table 6, which shows that our model outperformed seven out of 11 commercial software applications and tied in with one

Summary

Introduction

Advancements in mobile device technology led developers to make rich content applications for different purposes such as social media, health care, finance, and government. Mobile device usage increased drastically, and malicious software (malware) developers turned their attention to mobile application markets [1,2,3,4]. Malware may have many different goals such as encrypting personal data, using device resources (cryptocurrency), stealing sensitive information (financial information, pictures, contacts, etc.), converting victim’s machine into a bot, and restricting access to critical services [5]. According to the Statista report in [6], Android OS has the highest market share worldwide on mobile devices since 2011. As of May 2017, Android has over two billion monthly active users, and as of December 2018 the Google Play store features over 2.6 million apps [6]

Objectives

Methods

Results

Conclusion