Analysis of Android Malware Family Characteristic Based on Isomorphism of Sensitive API Call Graph
The analysis of multiple Android malware families indicates malware instances within a common malware family always have similar call graph structures. Based on the isomorphism of sensitive API call graph, we propose a method which is used to construct malware family features via combining static analysis approach with graph similarity metric. The experiment is performed on a malware dataset which contains 1326 malware samples from 16 different malware families. The result shows that the method can differentiate distinct malware family features and divide suspect malware samples into corresponding families with a high accuracy of 96.77% overall and even defend a certain extent of obfuscation.
- Research Article
120
- 10.1109/tdsc.2017.2739145
- Oct 23, 2019
- IEEE Transactions on Dependable and Secure Computing
As the most widely used mobile platform, Android is also the biggest target for mobile malware. Given the increasing number of Android malware variants, detecting malware families is crucial so that security analysts can identify situations where signatures of a known malware family can be adapted as opposed to manually inspecting behavior of all samples. We present EC2 (Ensemble Clustering and Classification), a novel algorithm for discovering Android malware families of varying sizes-ranging from very large to very small families (even if previously unseen). We present a performance comparison of several traditional classification and clustering algorithms for Android malware family identification on DREBIN, the largest public Android malware dataset with labeled families. We use the output of both supervised classifiers and unsupervised clustering to design EC2. Experimental results on both the DREBIN and the more recent Koodous malware datasets show that EC2 accurately detects both small and large families, outperforming several comparative baselines. Furthermore, we show how to automatically characterize and explain unique behaviors of specific malware families, such as FakeInstaller, MobileTx, Geinimi. In short, EC2 presents an early warning system for emerging new malware families, as well as a robust predictor of the family (when it is not new) to which a new malware sample belongs, and the design of novel strategies for data-driven understanding of malware behaviors.
- Research Article
21
- 10.1109/tc.2022.3143439
- Nov 1, 2022
- IEEE Transactions on Computers
Android malware is an ongoing threat to billions of smart devices’ security, ranging from mobile phones to car infotainment systems. Despite numerous approaches and previous studies to develop solutions for detecting and preventing Android malware, the rapid continuous development of new malware variants requires a careful reconsideration and the development of effective methods to identify malware families given a meager number of malware instances. In this paper, we present DroidMalVet, a novel Android malware family classification and detection approach that does not require to perform complex program analyses or utilize large feature sets. DroidMalVet is the first to use a promising, diverse, and small set of software metrics as features in a supervised learning platform to classify and detect various Android malware families. Our extensive empirical evaluations on two large public malware datasets show that DroidMalVet accurately detects both small and large malware families with F-Score accuracy of 94.4% and 96%, and AUC equal to 99.5% and 99.7% on the malware families in Drebin and AMD datasets, respectively. Moreover, our results demonstrate the superior performance of DroidMalVet in detecting small families (i.e., families with few samples). DroidMalVet complements existing approaches and presents an early warning tool for detecting known and emerging malware families.
- Research Article
246
- 10.1016/j.eswa.2013.07.106
- Aug 13, 2013
- Expert Systems with Applications
Dendroid: A text mining approach to analyzing and classifying code structures in Android malware families
- Research Article
18
- 10.1109/tnnls.2021.3099122
- Feb 1, 2023
- IEEE Transactions on Neural Networks and Learning Systems
We study the challenging task of malware recognition on both known and novel unknown malware families, called malware open-set recognition (MOSR). Previous works usually assume the malware families are known to the classifier in a close-set scenario, i.e., testing families are the subset or at most identical to training families. However, novel unknown malware families frequently emerge in real-world applications, and as such, require recognizing malware instances in an open-set scenario, i.e., some unknown families are also included in the test set, which has been rarely and nonthoroughly investigated in the cyber-security domain. One practical solution for MOSR may consider jointly classifying known and detecting unknown malware families by a single classifier (e.g., neural network) from the variance of the predicted probability distribution on known families. However, conventional well-trained classifiers usually tend to obtain overly high recognition probabilities in the outputs, especially when the instance feature distributions are similar to each other, e.g., unknown versus known malware families, and thus, dramatically degrade the recognition on novel unknown malware families. To address the problem and construct an applicable MOSR system, we propose a novel model that can conservatively synthesize malware instances to mimic unknown malware families and support a more robust training of the classifier. More specifically, we build upon the generative adversarial networks to explore and obtain marginal malware instances that are close to known families while falling into mimical unknown ones to guide the classifier to lower and flatten the recognition probabilities of unknown families and relatively raise that of known ones to rectify the performance of classification and detection. A cooperative training scheme involving the classification, synthesizing and rectification are further constructed to facilitate the training and jointly improve the model performance. Moreover, we also build a new large-scale malware dataset, named MAL-100, to fill the gap of lacking a large open-set malware benchmark dataset. Experimental results on two widely used malware datasets and our MAL-100 demonstrate the effectiveness of our model compared with other representative methods.
- Single Report
- 10.2172/1893244
- Oct 1, 2022
In recent years, infections and damage caused by malware have increased at exponential rates. At the same time, machine learning (ML) techniques have shown tremendous promise in many domains, often out performing human efforts by learning from large amounts of data. Results in the open literature suggest that ML is able to provide similar results for malware detection, achieving greater than 99% classification accuracy [49]. However, the same detection rates when applied in deployed settings have not been achieved. Malware is distinct from many other domains in which ML has shown success in that (1) it purposefully tries to hide, leading to noisy labels and (2) often its behavior is similar to benign software only differing in intent, among other complicating factors. This report details the reasons for the difficultly of detecting novel malware by ML methods and offers solutions to improve the detection of novel malware. We propose to detect malware by detecting behaviors commonly exhibited by malware such as DLL injection, and process hollowing. This is based on the assumption that there is a set of behaviors that are common to most malware samples and detecting them will generalize to novel malware. Additionally, detected behaviors point analysts toward appropriate handling and mitigation strategies, which is not the case with a binary benign/malicious classification. A behavior labeling method was developed and was used to label an existing malware dataset. Results show that detecting malicious behaviors is much more difficult than simply classifying malware and goodware?achieving 80% accuracy compared to reported 99% accuracy from classifying malware and goodware. This drop is due to several reasons which are detailed in the report. We also propose to evaluate the performance of detecting novel malware by holding out a malware family for testing and training on the other families. Traditional ML evaluation will shuffle the data and then split the data into training and testing. Our method addresses the use-case when novel malware families are encountered and they require more than just a malicious or benign designation. Our results suggest that this type of evaluation is much more difficult than traditional methods and provides more realistic results, albeit, significantly worse. For our behavior detection, accuracy decreases from 80% to 68% across all behaviors when holding out a malware family from training. We show that the degradation in performance is because each malware family has distinct characteristics resulting in high extrapolations by an ML model. Here, an ML model should return an "I do not know" response and request further analysis from an analyst. We run a number of experiments that compare novel malware families to the training data using different feature representations including a genomics-inspired distance measure and features extracted by deep learning. Generally, held-out families are significantly different from the training data, resulting in unpredictable results. This has been observed generally in the ML community [22, 9]. We empirically demonstrate this in the domain of malware detection. In an attempt to improve the detection of malware behaviors, we examine the impact that additional synthetic data has on the performance of an ML model in detecting behaviors in novel malware families. We find that while synthetic data does improve the performance of ML models, often simpler methods perform better than more complicated ones. Two generative modeling techniques were examined to produce synthetic malware samples such that the behaviors present are able to be specified externally. The difficulty is due to finer grained analysis of the executable and modifying the problem from a binary classification problem to a multi-label problem. The addition of synthetic data increases the overall accuracy from 68% to 70%. While far less accurate than measures presented in academic analyses, we believe that this is more representative of real-world performance and allows models to be properly placed within a malware detection system. We suggest that in highly dynamic environments ML pipelines should determine whether an ML model is competent in the area of new data and should involve mechanisms to improve over time with a human in the loop.
- Research Article
59
- 10.3390/electronics9060942
- Jun 5, 2020
- Electronics
Android receives major attention from security practitioners and researchers due to the influx number of malicious applications. For the past twelve years, Android malicious applications have been grouped into families. In the research community, detecting new malware families is a challenge. As we investigate, most of the literature reviews focus on surveying malware detection. Characterizing the malware families can improve the detection process and understand the malware patterns. For this reason, we conduct a comprehensive survey on the state-of-the-art Android malware familial detection, identification, and categorization techniques. We categorize the literature based on three dimensions: type of analysis, features, and methodologies and techniques. Furthermore, we report the datasets that are commonly used. Finally, we highlight the limitations that we identify in the literature, challenges, and future research directions regarding the Android malware family.
- Conference Article
23
- 10.1109/bigdata47090.2019.9005669
- Dec 1, 2019
Android malware (malicious apps) families share common attributes and behavior through sharing core malicious code. However, as the number of new malware increases, the task of identifying the correct family becomes more challenging. Two prominent approaches tackle this problem, either using dynamic analysis that captures the runtime behavior of the malware or using static analysis methods that can reveal malicious behavior by analyzing the underlying logic and code patterns. A third emerging way is to use the various sources of identification features to analyze the architectural and external attributes of a malicious app. For example, two malicious apps can have different behavioral patterns but share common attributes. We hypothesize that this malware can belong to the same family but attempt to mislead dynamic and code-level static analysis tools by randomizing their behavior. In this work, we utilize a promising set of Android-oriented code metrics that guide a supervised classification learning process for identifying malware families in Android. Our empirical results on 2,869 malware apps, across 35 different malware families, show that these metrics are very effective to identify malware families. In particular, we achieve low false positive rate (1.2%) and AUC score of 0.984 for family identification by using Random Forest (RF) classifier.
- Research Article
12
- 10.1109/access.2024.3400211
- Jan 1, 2024
- IEEE Access
Billions of people globally use Android devices <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><i>a</i></sup> . As such, these devices are highly targeted by security attackers. One of the most threatening attacks is to infect devices with malicious software (malware). Fortunately, there are various ways to counteract these attacks and prevent them. One of these methods is developing a comprehensive malware dataset that researchers can utilize for malware analysis, detection, prediction, and prevention systems. This paper introduces a unique, up-to-date, labeled Android malware dataset (Maloid-DS) comprising a comprehensive set of malware families that reached 345 families with 47,971 malware samples. First, we intensely studied existing datasets utilized by previous research works. These datasets are limited in (a) the number of studied families, (b) the number of samples under each family, (c) the number of new malware samples, (d) the proper categorization of the malware families, (e) the accurate mapping of the sample with its corresponding malware family, (f) providing well structuring of the malware families and subfamilies, and (g) presenting a profound description of each family behavior. All these limitations were seriously tackled by introducing Maloid-DS. The process of creating Maloid is detailed in this paper. Moreover, several case studies are demonstrated in this paper to show the value of Maloid and how different types of analysis systems and AI-based detection and prediction solutions could utilize it. While the full potential of Maloid-DS in real-world scenarios is subject to ongoing research and practical application, it represents a substantial contribution to the cybersecurity community, offering a broad and detailed foundation for protecting Android devices against malware threats.
- Research Article
7
- 10.1016/j.cose.2024.103714
- Jan 17, 2024
- Computers & Security
MAlign: Explainable static raw-byte based malware family classification using sequence alignment
- Conference Article
6
- 10.1109/ic3sis54991.2022.9885587
- Jun 23, 2022
Methodologies used for the detection of malicious applications can be broadly classified into static and dynamic analysis based approaches. With traditional signature-based methods, new variants of malware families cannot be detected. A combination of deep learning techniques along with image-based features is used in this work to classify malware. The data set used here is the ‘Malimg’ dataset, which contains a pictorial representation of well-known malware families. This paper proposes a methodology for identifying malware images and classifying them into various families. The classification is based on image features. The features are extracted using the pre-trained model namely VGG16. The samples of malware are depicted as byteplot grayscale images. Features are extracted employing the convolutional layer of a VGG16 deep learning network, which uses ImageNet dataset for the pre-training step. The features are used to train different classifiers which employ SVM, XGBoost, DNN and Random Forest for the classification task into different malware families. Using 9339 samples from 25 different malware families, we performed experimental evaluations and demonstrate that our approach is effective in identifying malware families with high accuracy.
- Conference Article
35
- 10.1145/3129676.3129712
- Sep 20, 2017
With thousands of malware samples pouring out every day, how can we reduce malware analysis time and detect them effectively? Malware family classification provides one of good measures to predict characteristics of unknown malware since malware belonging to the same family can have similar features. Static analysis and dynamic analysis are techniques to obtain features to be used for classifying malware samples to their families. Static analysis performs analysis based on specific signatures included in the malware. Static analysis has the advantages that the scope of the analysis covers the entire code, and the analysis can be performed without executing the malware. However, it is very difficult to detect or classify malware variants with only the results of the static analysis, because malware developers use polymorphic or encryption techniques to avoid static analysis-based detection of anti-virus software. Dynamic analysis analyzes malware behaviors, so the results of dynamic analysis can be used to detect or classify malware variants. One of dynamic features that can be used to detect or classify malware variants is API call sequences. In this paper, we propose a novel method to extract representative API call patterns of malware families using Recurrent Neural Network (RNN). We conducted experiments with 787 malware samples belonging to 9 families. In our experiments, we extracted representative API call patterns of 9 malware families on 551 samples as a training set and performed classification on the 236 samples as a test set. Classification accuracy results using API call patterns extracted from RNN were measured as 71% on average. The results show the feasibility of our approach using RNN to extract representative API call pattern of malware families for malware family classification.
- Research Article
44
- 10.1007/s11416-022-00416-3
- Feb 9, 2022
- Journal of Computer Virology and Hacking Techniques
Today, the extensive reliae on technology has exposed us to a constant threat of sophisticated malware attacks. Various automated malware production techniques have evolved, some of which reuse specific segments of existing malware to produce new malware, making malware detection challenging. In this paper, we propose a Convolutional Recurrence based malware classification technique that exploits the visual recurrences in the grayscale images of the malware samples belonging to the same malware families. Firstly, we convert the malware samples into grayscale images to capture the structural similarities from the malware samples using a Convolutional Neural Network architecture. Then we perform data augmentation to counter the effects of high data imbalance and reduce the class bias, such that training on that dataset would generate a more generalized framework. The augmented dataset is then passed through a VGG16 based feature extractor to extract the visual outliers amongst the malware families. Now, the extracted features are processed by passing them through two stacked BiLSTM layers. The outputs generated by the BiLSTM layers and the VGG16 layer are then merged to perform the final classification of the malware sample into its malware family. The model’s performance is further improved by using proper hyperparameter tuning. We compare the performance of our algorithm against several baseline methods and some contemporary state-of-the-art methods for visual malware detection by utilizing two benchmarked datasets. The obtained experimental results reveal the utility and efficacy of our proposed malware family classification technique.
- Research Article
97
- 10.1016/j.eswa.2019.113022
- Oct 16, 2019
- Expert Systems with Applications
Detecting malware evolution using support vector machines
- Book Chapter
6
- 10.1007/978-3-030-63095-9_26
- Jan 1, 2020
- Lecture notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
Malware has now grown up to be one of the most important threats in the internet security. As the number of malware families has increased rapidly, a malware classification model needs to classify the samples from emerging malware families. In real-world environment, the number of malware samples varies greatly with each family and some malware families only have a few samples. Therefore, it is a challenge task to obtain a malware classification model with strong generalization ability by using only a few labeled malware samples in each family. In this paper, we propose an attention-based transductive learning approach to tackle this problem. To extract features from raw malware binaries, our approach first converts them into gray-scale images. After visualization, an embedding function is used to encode the images into feature maps. Then we build an attention-based Gaussian similarity graph to help transduct the label information from well-labeled instances to unknown instances. With end-to-end training, we validate our attention-based transductive learning network on a malware database of 11,236 samples with 30 different malware families. Comparing with state-of-the-art approaches, the experimental results show that our approach achieves a better performance.
- Book Chapter
- 10.1007/978-981-16-0171-2_3
- Jan 1, 2021
Malware is dangerous for system and network users. Malware identification is an essential task in effectively detecting and preventing the computer system from self-infection, protecting it from potential data loss and system compromise. Commonly, there are 25 malware families exist. Traditional malware detection and anti-virus systems fail to classify the new variants of unknown malware into their corresponding families with the development of malicious code engineering, and it is possible to understand the malware variants and their features for new malware samples that carry variability and polymorphism. The detection methods can rarely detect such variants, but it is important in the cybersecurity field to investigate and detect large-scale malware samples more efficiently. In this paper, an accurate malware family classification model using a convolutional neural network technique is proposed. Malware family recognition is formulated as a multi-classification task, and an accurate solution is obtained by training convolutional neural network with images of malware executable files. Ten families of malware have been considered here for building the models. The image dataset with 2000 instances is applied to a convolutional neural network to build the classifier. The experimental results, based on a dataset of ten classes of malware families and 2000 malware images trained model, provide an accuracy of over 95% in discriminating from malware families. The techniques provide better results for classifying malware into families.