A3CM: Automatic Capability Annotation for Android Malware
Android malware poses serious security and privacy threats to the mobile users. Traditional malware detection and family classification technologies are becoming less effective due to the rapid evolution of the malware landscape, with the emerging of so-called zero-day-family malware families. To address this issue, our paper presents a novel research problem on automatically identifying the security/privacy-related capabilities of any detected malware, which we refer to as Malware Capability Annotation (MCA). Motivated by the observation that known and zero-day-family malware families share the security/privacy-related capabilities, MCA opens a new alternative way to effectively analyze zero-day-family malware (the malware that do not belong to any existing families) through exploring the related information and knowledge from known malware families. To address the MCA problem, we design a new MCA hunger solution, Automatic Capability Annotation for Android Malware (A3CM). A3CM works in the following four steps: 1) A3CM automatically extracts a set of semantic features such as permissions, API calls, network addresses from raw binary APKs to characterize malware samples; 2) A3CM applies a statistical embedding method to map the features into a joint feature space, so that malware samples can be represented as numerical vectors; 3) A3CM infers the malicious capabilities by using the multi-label classification model; 4) The trained multi-label model is used to annotate the malicious capabilities of the candidate malware samples. To facilitate the new research of MCA, we create a new ground truth dataset that consists of 6,899 annotated Android malware samples from 72 families. We carry out a large number of experiments based on the four representative security/privacy-related capabilities to evaluate the effectiveness of A3CM. Our results show that A3CM can achieve promising accuracy of 1.00, 0.98 and 0.63 in inferring multiple capabilities of known Android malware, small size-families' malware and zero-day-families' Android malware, respectively.
- Research Article
39
- 10.1109/tdsc.2020.2982635
- Jan 1, 2022
- IEEE Transactions on Dependable and Secure Computing
Despite the growing threat posed by the Android malware, the research community is still lacking a comprehensive view of common behaviors and emerging trends in malware families active on the platform. Without such view, researchers incur the risk of developing systems that only detect outdated threats, missing the most recent ones. In this article, we conduct the largest measurement of Android malware behavior to date, analyzing over 1.2 million malware samples that belong to 1.28K families over a period of eight years (from 2010 to 2017). We aim at understanding how Android malware has evolved over time, focusing on <i>repackaging</i> malware. In this type of threat different innocuous apps are piggybacked with a malicious payload (<i>rider</i>), allowing inexpensive malware manufacturing. One of the main challenges posed when studying repackaged malware is slicing the app to split benign components apart from the malicious ones. To address this problem, we use differential analysis to isolate software components that are irrelevant to the campaign and study the behavior of malicious riders alone. Our analysis framework relies on collective repositories and recent advances on the systematization of intelligence extracted from multiple anti-virus vendors. We find that since its infancy in 2010, the Android malware ecosystem has changed significantly, both in the type of malicious activity performed by malware and in the level of obfuscation used to avoid detection. Finally, we discuss what our findings mean for Android malware detection research, highlighting areas that need further attention by the research community. In particular, we show that riders of malware families evolve over time. This evidences important experimental bias in research works levering on automated systems for family identification without considering variants.
- Conference Article
91
- 10.1109/rdaaps48126.2021.9452002
- May 18, 2021
The unmatched threat of Android malware has tremendously increased the need for analyzing prominent malware samples. There are remarkable efforts in static and dynamic malware analysis using static features and API calls respectively. Nonetheless, there is a void to classify Android malware by analyzing its behavior using multiple dynamic characteristics. This paper proposes <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EntropLyzer</i> , an entropy-based behavioral analysis technique for classifying the behavior of 12 eminent Android malware categories and 147 malware families taken from CCCS-CIC-AndMal2020 dataset. This work uses six classes of dynamic characteristics including memory, API, network, logcat, battery, and process to classify and characterize Android malware. Results reveal that the entropy-based analysis successfully determines the behavior of all malware categories and most of the malware families before and after rebooting the emulator.
- Conference Article
5
- 10.1109/cyberc.2017.36
- Oct 1, 2017
With the popularity of mobile platform, users' sensitive information and financial security are connecting closely to smartphones. An increasing number of malware samples appear in Android since it is the most popular operating system. However, there are no systematic methods to analyze and recognize those malware samples. Malware is labeled and classified in different standards. In this paper, Android malware gene is defined and extracted to recognize Android malware systematically. A malware gene is the minimum subsequence of statements to result in the functional information and it commonly occurs in a malware family. Moreover, a clustering via K-means is utilized to validate the identification effect of Android malware gene. Experimental results show that it is effective to recognize and analyze Android malware by means of malware gene.
- Conference Article
5
- 10.1109/sin54109.2021.9699322
- Dec 15, 2021
With the increased popularity and wide adoption of Android as a mobile OS platform, it has been a major target for malware authors. Due to unprecedented rapid growth in the number, variants, and diversity of malware, detecting malware on the Android platform has become challenging. Beyond the detection of a malware, classifying the family the malware belongs to, helps security analysts to reuse malware removal techniques that is known to work for that family of malware. It takes manual analysis if a malware belongs to an unknown family. Therefore, classifying malware into exact family is important. This paper presents a technique and tool named MAPFam that applies machine learning on static features from the Manifest file and API packages to classify an Android malware into its family. This work is premised on a starting hypothesis that features extracted from API packages rather than on API calls lead to more precise classification. Our experiments indeed shows that API package based models provides ∼1.63X more accurate classification compared to an API call based method. Our machine learning based malware family classification system uses API packages, requested permissions, and other features from the Manifest files. The proposed family classification system achieves accuracy and average precision above 97% for the top 60 malware families by using only 81 features with 97.55% of model reliability rate (Kappa score). The experimental results also shows that MAPFam can perfectly identity 36 malware families.
- Research Article
56
- 10.1109/tcyb.2022.3164625
- Jan 1, 2023
- IEEE Transactions on Cybernetics
Evolving Android malware poses a severe security threat to mobile users, and machine-learning (ML)-based defense techniques attract active research. Due to the lack of knowledge, many zero-day families' malware may remain undetected until the classifier gains specialized knowledge. The most existing ML-based methods will take a long time to learn new malware families in the latest malware family landscape. Existing ML-based Android malware detection and classification methods struggle with the fast evolution of the malware landscape, particularly in terms of the emergence of zero-day malware families and limited representation of single-view features. In this article, a new multiview feature intelligence (MFI) framework is developed to learn the representation of a targeted capability from known malware families for recognizing unknown and evolving malware with the same capability. The new framework performs reverse engineering to extract multiview heterogeneous features, including semantic string features, API call graph features, and smali opcode sequential features. It can learn the representation of a targeted capability from known malware families through a series of processes of feature analysis, selection, aggregation, and encoding, to detect unknown Android malware with shared target capability. We create a new dataset with ground-truth information regarding capability. Many experiments are conducted on the new dataset to evaluate the performance and effectiveness of the new method. The results demonstrate that the new method outperforms three state-of-the-art methods, including: 1) Drebin; 2) MaMaDroid; and 3) N -opcode, when detecting unknown Android malware with targeted capabilities.
- Research Article
12
- 10.1109/access.2024.3400211
- Jan 1, 2024
- IEEE Access
Billions of people globally use Android devices <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><i>a</i></sup> . As such, these devices are highly targeted by security attackers. One of the most threatening attacks is to infect devices with malicious software (malware). Fortunately, there are various ways to counteract these attacks and prevent them. One of these methods is developing a comprehensive malware dataset that researchers can utilize for malware analysis, detection, prediction, and prevention systems. This paper introduces a unique, up-to-date, labeled Android malware dataset (Maloid-DS) comprising a comprehensive set of malware families that reached 345 families with 47,971 malware samples. First, we intensely studied existing datasets utilized by previous research works. These datasets are limited in (a) the number of studied families, (b) the number of samples under each family, (c) the number of new malware samples, (d) the proper categorization of the malware families, (e) the accurate mapping of the sample with its corresponding malware family, (f) providing well structuring of the malware families and subfamilies, and (g) presenting a profound description of each family behavior. All these limitations were seriously tackled by introducing Maloid-DS. The process of creating Maloid is detailed in this paper. Moreover, several case studies are demonstrated in this paper to show the value of Maloid and how different types of analysis systems and AI-based detection and prediction solutions could utilize it. While the full potential of Maloid-DS in real-world scenarios is subject to ongoing research and practical application, it represents a substantial contribution to the cybersecurity community, offering a broad and detailed foundation for protecting Android devices against malware threats.
- Research Article
18
- 10.1109/tdsc.2022.3219082
- Jan 1, 2024
- IEEE Transactions on Dependable and Secure Computing
Due to its open-source nature, Android operating system has been the main target of attackers to exploit. Malware creators always perform different code obfuscations on their apps to hide malicious activities. Features extracted from these obfuscated samples through program analysis contain many useless and disguised features, which leads to many false negatives. To address the issue, in this paper, we demonstrate that obfuscation-resilient malware family analysis can be achieved through contrastive learning. The key insight behind our analysis is that contrastive learning can be used to reduce the difference introduced by obfuscation while amplifying the difference between malware and other types of malware. Based on the proposed analysis, we design a system that can achieve robust and interpretable classification of Android malware. To achieve robust classification, we perform contrastive learning on malware samples to learn an encoder that can automatically extract robust features from malware samples. To achieve interpretable classification, we transform the function call graph of a sample into an image by centrality analysis. Then the corresponding heatmaps can be obtained by visualization techniques. These heatmaps can help users understand why the malware is classified as this family. We implement <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IFDroid</i> and perform extensive evaluations on two datasets. Experimental results show that <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IFDroid</i> is superior to state-of-the-art Android malware familial classification systems. Moreover, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IFDroid</i> is capable of maintaining a 98.4% F1 on classifying 69,421 obfuscated malware samples.
- Research Article
120
- 10.1109/tdsc.2017.2739145
- Oct 23, 2019
- IEEE Transactions on Dependable and Secure Computing
As the most widely used mobile platform, Android is also the biggest target for mobile malware. Given the increasing number of Android malware variants, detecting malware families is crucial so that security analysts can identify situations where signatures of a known malware family can be adapted as opposed to manually inspecting behavior of all samples. We present EC2 (Ensemble Clustering and Classification), a novel algorithm for discovering Android malware families of varying sizes-ranging from very large to very small families (even if previously unseen). We present a performance comparison of several traditional classification and clustering algorithms for Android malware family identification on DREBIN, the largest public Android malware dataset with labeled families. We use the output of both supervised classifiers and unsupervised clustering to design EC2. Experimental results on both the DREBIN and the more recent Koodous malware datasets show that EC2 accurately detects both small and large families, outperforming several comparative baselines. Furthermore, we show how to automatically characterize and explain unique behaviors of specific malware families, such as FakeInstaller, MobileTx, Geinimi. In short, EC2 presents an early warning system for emerging new malware families, as well as a robust predictor of the family (when it is not new) to which a new malware sample belongs, and the design of novel strategies for data-driven understanding of malware behaviors.
- Conference Article
35
- 10.1145/3129676.3129712
- Sep 20, 2017
With thousands of malware samples pouring out every day, how can we reduce malware analysis time and detect them effectively? Malware family classification provides one of good measures to predict characteristics of unknown malware since malware belonging to the same family can have similar features. Static analysis and dynamic analysis are techniques to obtain features to be used for classifying malware samples to their families. Static analysis performs analysis based on specific signatures included in the malware. Static analysis has the advantages that the scope of the analysis covers the entire code, and the analysis can be performed without executing the malware. However, it is very difficult to detect or classify malware variants with only the results of the static analysis, because malware developers use polymorphic or encryption techniques to avoid static analysis-based detection of anti-virus software. Dynamic analysis analyzes malware behaviors, so the results of dynamic analysis can be used to detect or classify malware variants. One of dynamic features that can be used to detect or classify malware variants is API call sequences. In this paper, we propose a novel method to extract representative API call patterns of malware families using Recurrent Neural Network (RNN). We conducted experiments with 787 malware samples belonging to 9 families. In our experiments, we extracted representative API call patterns of 9 malware families on 551 samples as a training set and performed classification on the 236 samples as a test set. Classification accuracy results using API call patterns extracted from RNN were measured as 71% on average. The results show the feasibility of our approach using RNN to extract representative API call pattern of malware families for malware family classification.
- Book Chapter
- 10.1007/978-3-030-74664-3_4
- Jan 1, 2021
Security practitioners can combat large-scale Android malware by decreasing the analysis window size of newly detected malware. The window starts from the first detection until signature generation by anti-malware vendors. The larger the window is, the more time the malicious apps are given to spread over the users’ devices. Current state-of-the-art techniques have a large analysis window due to the significant number of Android malware appearing daily. Besides, these techniques use manual analysis in some cases to investigate malware. Therefore, decreasing the need for manual detection could significantly reduce the analysis window. To address the aforementioned issue, we elaborate systematic techniques and tools for the detection of both known family apps and new malware family apps (i.e., variants of existing families or unseen malware). To do so, we rely on the assumption that any pair of Android apps, with distinct authors and certificates, are most likely to be malicious if they are highly similar. Because the adversary usually repackages multiple app packages with the same malicious payload to hide it from anti-malware and vetting systems. Consequently, it is difficult to detect such malicious payloads from benign functionalities of a given Android package. Accordingly, a pair of Android apps should not be very similar in their components, excluding popular libraries. This observation, as mentioned earlier, could be used to design and develop a security framework to detect Android malware apps.In this chapter, we propose a novel Android app fingerprinting technique, APK-DNA, inspired by fuzzy hashing. We specifically target fingerprinting Android malicious apps. Computing the APK-DNA of a suspicious app requires a low computation time. Afterward, we leverage the previously mentioned assumption (i.e., very similar apps might be malware from the same malware family) to propose a cyber-security framework, namely Cypider (Cyber-Spider for Android malware detection), to detect and cluster Android malware without prior- knowledge of Android malware apps. Cypider consists of a novel combination of a set of techniques to address the problem of Android malware, clustering, and fingerprinting. First, Cypider can detect repackaged malware (malware families), which constitute the vast majority of Android malware apps (Zhou and Jiang (Dissecting android malware: Characterization and evolution, in IEEE Symposium on Security and Privacy, SP 2012, 21–23 May 2012, San Francisco (2012), pp. 95–109)). Second, it can detect new malware apps, and more importantly, Cypider performs the detection automatically and in an unsupervised way (i.e., no prior-knowledge about the apps). The fundamental idea of Cypider relies on building a similarity network between the targeted apps static content in terms of fuzzy fingerprints. Actually, Cypider extracts, from this similarity network, sub-graphs with high connectivity, called communities, which are most likely to be malicious communities.
- Conference Article
13
- 10.1109/ssic.2018.8556755
- Oct 1, 2018
Android malware detection has become increasingly important over the past few years, due to the popularity of Android devices and the explosive growth of Android applications. This asks for more effective techniques to detect the Android malware. Some works in the literature show that the opcode sequences have a remarkable effect on Android malware detection. However, they omitted the information contained in operand sequences. In this paper, we do not analyse the opcode sequences but the API calls used in operand sequences, and abstract the API calls to their package names with the aim to be resilient to API changes in different Android API levels. In order to avoid to be computationally expensive, we only capitalize on the n-grams analysis. In addition, we apply the package level information extracted from API calls to build a Android malware prediction model. We perform experiments on malicious Android applications, composed of 5560 malware samples which are belong to Drebin dataset, 361 malware samples collected from Contagio Mobile Malware and 5900 benign Android applications retrieved from Google Play. Results show that the accuracy of our approach exceeds the opcode n-grams in some ways.
- Research Article
3
- 10.14569/ijacsa.2017.080411
- Jan 1, 2017
- International Journal of Advanced Computer Science and Applications
The complexity and the number of mobile malware are increasing continually as the usage of smartphones continue to rise. The popularity of Android has increased the number of malware that target Android-based smartphones. Developing efficient and effective approaches for Android malware classification is emerging as a new challenge. This paper introduces an effective Android malware classifier based on the weighted bipartite graph. This classifier includes two phases: in the first phase, the permissions and API Calls used in the Android app are utilized to construct the weighted bipartite graph; the feature importance scores are integrated as weights in the bipartite graph to improve the discrimination between malware and goodware apps, by incorporating extra meaningful information into the graph structure. The second phase applied multiple classifiers to categorise the Android application as a malware or goodware. The results using an Android malware dataset consists of different malware families, showing the effectiveness of our approach toward Android malware classification.
- Conference Article
13
- 10.1109/pst.2017.00036
- Jun 1, 2017
In this paper we propose a heuristic approach to static analysis of Android applications based on matching suspicious applications with the predefined malware models. Static models are built from Android capabilities and Android Framework API call chains used by the application. All of the analysis steps and model construction are fully automated. Therefore, the method can be easily deployed as one of the automated checks provided by mobile application marketplaces or other interested organizations. Using the proposed method we analyzed the Drebin and ISCX malware collections in order to find possible relationships and dependencies between samples in collections, and a large fraction of Google Play apps collected between 2013 and 2016 representing benign data. Analysis results show that a combination of relatively simple static features represented by permissions and API call chains is enough to perform binary classification between malware and benign apps, and even find the corresponding malware family, with an appropriate false positive rate of about 3% (less than 1% in case of filtering adware). Malware collections exploration results show that Android malware rarely uses obfuscation or encryption techniques to make static analysis more difficult, which is quite the opposite of what we see in the case of the 'Wintel' endpoint platform family. We also provide the experiment-based comparison with the previously proposed state-of-the-art Android malware detection method adagio.
- Research Article
6
- 10.3390/informatics10030067
- Aug 18, 2023
- Informatics
There are a variety of reasons why smartphones have grown so pervasive in our daily lives. While their benefits are undeniable, Android users must be vigilant against malicious apps. The goal of this study was to develop a broad framework for detecting Android malware using multiple deep learning classifiers; this framework was given the name DroidMDetection. To provide precise, dynamic, Android malware detection and clustering of different families of malware, the framework makes use of unique methodologies built based on deep learning and natural language processing (NLP) techniques. When compared to other similar works, DroidMDetection (1) uses API calls and intents in addition to the common permissions to accomplish broad malware analysis, (2) uses digests of features in which a deep auto-encoder generates to cluster the detected malware samples into malware family groups, and (3) benefits from both methods of feature extraction and selection. Numerous reference datasets were used to conduct in-depth analyses of the framework. DroidMDetection’s detection rate was high, and the created clusters were relatively consistent, no matter the evaluation parameters. DroidMDetection surpasses state-of-the-art solutions MaMaDroid, DroidMalwareDetector, MalDozer, and DroidAPIMiner across all metrics we used to measure their effectiveness.
- Research Article
238
- 10.1109/tifs.2018.2806891
- Aug 1, 2018
- IEEE Transactions on Information Forensics and Security
The rapid increase in the number of Android malware poses great challenges to anti-malware systems, because the sheer number of malware samples overwhelms malware analysis systems. The classification of malware samples into families, such that the common features shared by malware samples in the same family can be exploited in malware detection and inspection, is a promising approach for accelerating malware analysis. Furthermore, the selection of representative malware samples in each family can drastically decrease the number of malware to be analyzed. However, the existing classification solutions are limited because of the following reasons. First, the legitimate part of the malware may misguide the classification algorithms because the majority of Android malware are constructed by inserting malicious components into popular apps. Second, the polymorphic variants of Android malware can evade detection by employing transformation attacks. In this paper, we propose a novel approach that constructs frequent subgraphs (fregraphs) to represent the common behaviors of malware samples that belong to the same family. Moreover, we propose and develop FalDroid, a novel system that automatically classifies Android malware and selects representative malware samples in accordance with fregraphs. We apply it to 8407 malware samples from 36 families. Experimental results show that FalDroid can correctly classify 94.2% of malware samples into their families using approximately 4.6 sec per app. FalDroid can also dramatically reduce the cost of malware investigation by selecting only 8.5% to 22% representative samples that exhibit the most common malicious behavior among all samples.