Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Fingerprinting Android malware families

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The domination of the Android operating system in the market share of smart terminals has engendered increasing threats of malicious applications (apps). Research on Android malware detection has received considerable attention in academia and the industry. In particular, studies on malware families have been beneficial to malware detection and behavior analysis. However, identifying the characteristics of malware families and the features that can describe a particular family have been less frequently discussed in existing work. In this paper, we are motivated to explore the key features that can classify and describe the behaviors of Android malware families to enable fingerprinting the malware families with these features. We present a framework for signature-based key feature construction. In addition, we propose a frequency-based feature elimination algorithm to select the key features. Finally, we construct the fingerprints of ten malware families, including twenty key features in three categories. Results of extensive experiments using Support Vector Machine demonstrate that the malware family classification achieves an accuracy of 92% to 99%. The typical behaviors of malware families are analyzed based on the selected key features. The results demonstrate the feasibility and effectiveness of the presented algorithm and fingerprinting method.

Similar Papers
  • Research Article
  • Cite Count Icon 6
  • 10.1016/j.dib.2023.109750
Android malware detection with MH-100K: An innovative dataset for advanced research
  • Nov 2, 2023
  • Data in Brief
  • Hendrio Bragança + 5 more

High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time.

  • Research Article
  • Cite Count Icon 120
  • 10.1109/tdsc.2017.2739145
EC2: Ensemble Clustering and Classification for Predicting Android Malware Families
  • Oct 23, 2019
  • IEEE Transactions on Dependable and Secure Computing
  • Tanmoy Chakraborty + 2 more

As the most widely used mobile platform, Android is also the biggest target for mobile malware. Given the increasing number of Android malware variants, detecting malware families is crucial so that security analysts can identify situations where signatures of a known malware family can be adapted as opposed to manually inspecting behavior of all samples. We present EC2 (Ensemble Clustering and Classification), a novel algorithm for discovering Android malware families of varying sizes-ranging from very large to very small families (even if previously unseen). We present a performance comparison of several traditional classification and clustering algorithms for Android malware family identification on DREBIN, the largest public Android malware dataset with labeled families. We use the output of both supervised classifiers and unsupervised clustering to design EC2. Experimental results on both the DREBIN and the more recent Koodous malware datasets show that EC2 accurately detects both small and large families, outperforming several comparative baselines. Furthermore, we show how to automatically characterize and explain unique behaviors of specific malware families, such as FakeInstaller, MobileTx, Geinimi. In short, EC2 presents an early warning system for emerging new malware families, as well as a robust predictor of the family (when it is not new) to which a new malware sample belongs, and the design of novel strategies for data-driven understanding of malware behaviors.

  • Research Article
  • Cite Count Icon 113
  • 10.1016/j.cose.2021.102399
KronoDroid: Time-based Hybrid-featured Dataset for Effective Android Malware Detection and Characterization
  • Jul 9, 2021
  • Computers & Security
  • Alejandro Guerra-Manzanares + 2 more

KronoDroid: Time-based Hybrid-featured Dataset for Effective Android Malware Detection and Characterization

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 17
  • 10.1155/2020/6726147
Combat Mobile Evasive Malware via Skip-Gram-Based Malware Detection
  • Apr 20, 2020
  • Security and Communication Networks
  • Alper Egitmen + 5 more

Android malware detection is an important research topic in the security area. There are a variety of existing malware detection models based on static and dynamic malware analysis. However, most of these models are not very successful when it comes to evasive malware detection. In this study, we aimed to create a malware detection model based on a natural language model called skip-gram to detect evasive malware with the highest accuracy rate possible. In order to train and test our proposed model, we used an up-to-date malware dataset called Argus Android Malware Dataset (AMD) since the AMD contains various evasive malware families and detailed information about them. Meanwhile, for the benign samples, we used Comodo Android Benign Dataset. Our proposed model starts with extracting skip-gram-based features from instruction sequences of Android applications. Then it applies several machine learning algorithms to classify samples as benign or malware. We tested our proposed model with two different scenarios. In the first scenario, the random forest-based classifier performed with 95.64% detection accuracy on the entire dataset and 95% detection accuracy against evasive only samples. In the second scenario, we created a test dataset that contained zero-day malware samples only. For the training set, we did not use any sample that belongs to the malware families in the test set. The random forest-based model performed with 37.36% accuracy rate against zero-day malware. In addition, we compared our proposed model’s malware detection performance against several commercial antimalware applications using VirusTotal API. Our model outperformed 7 out of 10 antimalware applications and tied with one of them on the same test scenario.

  • Research Article
  • Cite Count Icon 4
  • 10.1093/comjnl/bxac114
Ensemble Framework Combining Family Information for Android Malware Detection
  • Aug 20, 2022
  • The Computer Journal
  • Yao Li + 5 more

Each malware application belongs to a specific malware family, and each family has unique characteristics. However, existing Android malware detection schemes do not pay attention to the use of malware family information. If the family information is exploited well, it could improve the accuracy of malware detection. In this paper, we propose a general Ensemble framework combining Family Information for Android Malware Detector, called EFIMDetector. First, eight categories of features are extracted from Android application packages. Then, we define the malware family with a large sample size as a prosperous family and construct a classifier for each prosperous family as a conspicuousness evaluator for the family characteristics. These conspicuousness evaluators are combined with a general classifier (which can be a base or ensemble classifier in itself), called the final classifier, to form a two-layer ensemble framework. For the samples of prosperous families with conspicuous family characteristics, the conspicuousness evaluators directly provide detection results. For other samples (including the samples of prosperous families with nonconspicuous family characteristics and the samples of nonprosperous families), the final classifier is responsible for detection. Seven common base classifiers and three common ensemble classifiers are used to detect malware in the experiment. The results show that the proposed ensemble framework can effectively improve the detection accuracy of these classifiers.

  • Conference Article
  • Cite Count Icon 13
  • 10.1109/uic-atc-scalcom-cbdcom-iop.2015.135
API Sequences Based Malware Detection for Android
  • Aug 1, 2015
  • Jiawei Zhu + 3 more

To mitigate security problem brought by Android malware, various work has been proposed such as behavior based malware detection and data mining based malware detection. In this paper, we put forward a novel Android malware detection model using data mining techniques. We design an algorithm with two steps. The first step is modeling Android application code into graph structure, called API control flow graph by us. Next step is calculating API sequences fulfilling minimum intra-family support in each malware family because malware in malware family usually share similar behavior pattern. Finally, supervised learning method is took advantage in building our malware detecting model with API sequences as input features. We evaluate this model with 1200 applications, half of them are malicious and half are benign, and find it effective in identifying Android malware and even unknown malware.

  • Research Article
  • Cite Count Icon 62
  • 10.1016/j.pmcj.2021.101336
ProDroid — An Android malware detection framework based on profile hidden Markov model
  • Jan 21, 2021
  • Pervasive and Mobile Computing
  • Satheesh Kumar Sasidharan + 1 more

ProDroid — An Android malware detection framework based on profile hidden Markov model

  • Research Article
  • Cite Count Icon 68
  • 10.1016/j.procs.2021.03.118
Towards Explainable CNNs for Android Malware Detection
  • Jan 1, 2021
  • Procedia Computer Science
  • Martin Kinkead + 3 more

Towards Explainable CNNs for Android Malware Detection

  • Research Article
  • Cite Count Icon 55
  • 10.1109/tr.2019.2924677
DAMBA: Detecting Android Malware by ORGB Analysis
  • Mar 1, 2020
  • IEEE Transactions on Reliability
  • Weizhe Zhang + 3 more

With the rapid development of smart devices, mobile phones have permeated many aspects of our life. Unfortunately, their widespread popularization attracted endless attacks that are serious threats for users. As the mobile system with the largest market share, Android has already become the hardest hit for years. To Detect Android Malware by ORGB Analysis, in this paper, we present DAMBA, a novel prototype system based on a C/S architecture. DAMBA extracts the static and dynamic features of apps. For further analyses, we propose TANMAD algorithm, a two-step Android malware detection algorithm, which reduces the range of possible malware families, and then utilizes subgraph isomorphism matching for malware detection. The key novelty of this paper is the modeling of object reference information by constructing directed graphs, which is called object reference graph birthmarks (ORGB). To achieve better efficiency and accuracy, in this paper, we present several optimization strategies for hybrid analysis. DAMBA is evaluated on a large real-world dataset of 2239 malicious and 1000 popular benign apps. The detection accuracy reaches 100% in most cases, and the average detection time is less than 5 s. Experimental results show that DAMBA outperforms the well-known detector, McAfee, which is based on signature recognition. In addition, DAMBA is demonstrated to resist the known malware attacks and their variants efficiently, as well as malware that uses obfuscation techniques.

  • Single Report
  • 10.2172/1893244
MalGen: Malware Generation with Specific Behaviors to Improve Machine Learning-based Detectors
  • Oct 1, 2022
  • Michael Smith + 12 more

In recent years, infections and damage caused by malware have increased at exponential rates. At the same time, machine learning (ML) techniques have shown tremendous promise in many domains, often out performing human efforts by learning from large amounts of data. Results in the open literature suggest that ML is able to provide similar results for malware detection, achieving greater than 99% classification accuracy [49]. However, the same detection rates when applied in deployed settings have not been achieved. Malware is distinct from many other domains in which ML has shown success in that (1) it purposefully tries to hide, leading to noisy labels and (2) often its behavior is similar to benign software only differing in intent, among other complicating factors. This report details the reasons for the difficultly of detecting novel malware by ML methods and offers solutions to improve the detection of novel malware. We propose to detect malware by detecting behaviors commonly exhibited by malware such as DLL injection, and process hollowing. This is based on the assumption that there is a set of behaviors that are common to most malware samples and detecting them will generalize to novel malware. Additionally, detected behaviors point analysts toward appropriate handling and mitigation strategies, which is not the case with a binary benign/malicious classification. A behavior labeling method was developed and was used to label an existing malware dataset. Results show that detecting malicious behaviors is much more difficult than simply classifying malware and goodware?achieving 80% accuracy compared to reported 99% accuracy from classifying malware and goodware. This drop is due to several reasons which are detailed in the report. We also propose to evaluate the performance of detecting novel malware by holding out a malware family for testing and training on the other families. Traditional ML evaluation will shuffle the data and then split the data into training and testing. Our method addresses the use-case when novel malware families are encountered and they require more than just a malicious or benign designation. Our results suggest that this type of evaluation is much more difficult than traditional methods and provides more realistic results, albeit, significantly worse. For our behavior detection, accuracy decreases from 80% to 68% across all behaviors when holding out a malware family from training. We show that the degradation in performance is because each malware family has distinct characteristics resulting in high extrapolations by an ML model. Here, an ML model should return an "I do not know" response and request further analysis from an analyst. We run a number of experiments that compare novel malware families to the training data using different feature representations including a genomics-inspired distance measure and features extracted by deep learning. Generally, held-out families are significantly different from the training data, resulting in unpredictable results. This has been observed generally in the ML community [22, 9]. We empirically demonstrate this in the domain of malware detection. In an attempt to improve the detection of malware behaviors, we examine the impact that additional synthetic data has on the performance of an ML model in detecting behaviors in novel malware families. We find that while synthetic data does improve the performance of ML models, often simpler methods perform better than more complicated ones. Two generative modeling techniques were examined to produce synthetic malware samples such that the behaviors present are able to be specified externally. The difficulty is due to finer grained analysis of the executable and modifying the problem from a binary classification problem to a multi-label problem. The addition of synthetic data increases the overall accuracy from 68% to 70%. While far less accurate than measures presented in academic analyses, we believe that this is more representative of real-world performance and allows models to be properly placed within a malware detection system. We suggest that in highly dynamic environments ML pipelines should determine whether an ML model is competent in the area of new data and should involve mechanisms to improve over time with a human in the loop.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/iscmi56532.2022.10068453
Enhancing Classification Performance for Android Small Sample Malicious Families Using Hybrid RGB Image Augmentation Method
  • Nov 26, 2022
  • Yi-Hsuan Ting + 2 more

With the improvement of computer computing speed, many researches use deep learning for Android malware detection. In addition to malware detection, malware family classification will help malware researchers understand the behavior of the malware families to optimize detection and prevent However, the new malware family has few samples, which lead to bad classification results. GAN-based method can improve the classification results, but minor data will still lead to the unstable quality of the data generated by the deep learning augmentation method, which will limit the improvement of classification results. In the study, we will propose a hybrid augmentation method, first extracting malware features and converting them into RGB images, and then the minor families will augment by the gaussian noise augmentation method, and then combined with the deep convolutional generative adversarial network (DCGAN) which have better effect on image augmentation, and finally input to CNN for family classification. The experimental results show that using the hybrid augmentation method proposed in the study, compared to no augmentation and augmentation with only using the deep convolutional generative adversarial network, the F1-Score increased between 7%~34% and 2%~7%.

  • Research Article
  • Cite Count Icon 21
  • 10.1109/tc.2022.3143439
Lightweight, Effective Detection and Characterization of Mobile Malware Families
  • Nov 1, 2022
  • IEEE Transactions on Computers
  • Karim O Elish + 2 more

Android malware is an ongoing threat to billions of smart devices’ security, ranging from mobile phones to car infotainment systems. Despite numerous approaches and previous studies to develop solutions for detecting and preventing Android malware, the rapid continuous development of new malware variants requires a careful reconsideration and the development of effective methods to identify malware families given a meager number of malware instances. In this paper, we present DroidMalVet, a novel Android malware family classification and detection approach that does not require to perform complex program analyses or utilize large feature sets. DroidMalVet is the first to use a promising, diverse, and small set of software metrics as features in a supervised learning platform to classify and detect various Android malware families. Our extensive empirical evaluations on two large public malware datasets show that DroidMalVet accurately detects both small and large malware families with F-Score accuracy of 94.4% and 96%, and AUC equal to 99.5% and 99.7% on the malware families in Drebin and AMD datasets, respectively. Moreover, our results demonstrate the superior performance of DroidMalVet in detecting small families (i.e., families with few samples). DroidMalVet complements existing approaches and presents an early warning tool for detecting known and emerging malware families.

  • Research Article
  • Cite Count Icon 72
  • 10.1016/j.cose.2018.10.001
A scalable and extensible framework for android malware detection and family attribution
  • Oct 9, 2018
  • Computers & Security
  • Li Zhang + 2 more

A scalable and extensible framework for android malware detection and family attribution

  • Research Article
  • Cite Count Icon 28
  • 10.1002/spe.3112
Android malware detection using network traffic based on sequential deep learning models
  • Jun 6, 2022
  • Software: Practice and Experience
  • Somayyeh Fallah + 1 more

The increasing trend of smartphone capabilities has caught the attention of many users. This has led to the emergence of malware that threatening the users' privacy and security. Many malware detection methods have been proposed to deal with emerging threats. One of the most effective ones is to use network traffic analysis. This article proposed a method based on LSTM (Long Short‐term Memory) for malware detection which is capable of not only distinguishing malware and benign samples, but also detecting and identify the new and unseen families of malware. As far as we know, this is the first time that traffic data has been modeled as a sequence of flows and a sequential based deep learning model is employed. In this article, we have performed several case studies to exhibit the capabilities of the proposed method including malware detection, malware family identification, new (not seen before) malware family detection, as well as evaluating the minimum time required to detect malware. The case studies show that the model is even capable of detecting new families of malware with more than 90% accuracy, although these results can only be verified on existing families in this dataset and such a claim cannot be generalized to other examples of malware. Moreover, it is shown the model is able to detect the malware through capturing 50 connection flows (about 1600 packets in average) with the AUC of more than 99.9%.

  • Conference Article
  • Cite Count Icon 19
  • 10.1109/ccnc51644.2023.10060381
Towards a Reliable Hierarchical Android Malware Detection Through Image-based CNN
  • Jan 8, 2023
  • Jhonatan Geremias + 4 more

The number of Android malicious applications keeps growing as time passes, even paving their way to official app markets. In recent years, a promising malware detection approach makes use of the compiled app source codes (dex), through convolutional neural networks (CNN) as an image classification task. Unfortunately, current proposals often rely on unrealistic datasets, focusing their detection on the mal-ware families, while neglecting the detection of malware apps in the first place. In this paper, we propose a reliable and hierarchical Android malware detection through an image-based CNN scheme, implemented twofold. First, Android malware classification is performed in a hierarchically-structured local manner, initially identifying malware apps, then, their related family. Second, to ensure reliability and improve classification accuracy, only highly confident classified apps are reported, in a classification with reject option rationale. Experiments performed in a new dataset with over 26 thousand Android apps, divided into 29 malware families, compounding over 13 GB of app dex images, have shown that current image-based CNN for malware detection is unable to provide high detection accuracies. In contrast, our proposed model is able to reliably detect malware apps, improving the true-negative rates by up to 5.5%, and the average true-positive rate of the malware families of accepted apps by up to 12.7%, while rejecting only 10% of Android apps.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant