A Deep Dive Inside DREBIN: An Explorative Analysis beyond Android Malware Detection Scores
Machine learning advances have been extensively explored for implementing large-scale malware detection. When reported in the literature, performance evaluation of machine learning based detectors generally focuses on highlighting the ratio of samples that are correctly or incorrectly classified, overlooking essential questions on why/how the learned models can be demonstrated as reliable. In the Android ecosystem, several recent studies have highlighted how evaluation setups can carry biases related to datasets or evaluation methodologies. Nevertheless, there is little work attempting to dissect the produced model to provide some understanding of its intrinsic characteristics. In this work, we fill this gap by performing a comprehensive analysis of a state-of-the-art Android malware detector, namely DREBIN, which constitutes today a key reference in the literature. Our study mainly targets an in-depth understanding of the classifier characteristics in terms of (1) which features actually matter among the hundreds of thousands that DREBIN extracts, (2) whether the high scores of the classifier are dependent on the dataset age, and (3) whether DREBIN’s explanations are consistent within malware families, among others. Overall, our tentative analysis provides insights into the discriminatory power of the feature set used by DREBIN to detect malware. We expect our findings to bring about a systematisation of knowledge for the community.
- Research Article
55
- 10.1109/tcyb.2022.3164625
- Jan 1, 2023
- IEEE Transactions on Cybernetics
Evolving Android malware poses a severe security threat to mobile users, and machine-learning (ML)-based defense techniques attract active research. Due to the lack of knowledge, many zero-day families' malware may remain undetected until the classifier gains specialized knowledge. The most existing ML-based methods will take a long time to learn new malware families in the latest malware family landscape. Existing ML-based Android malware detection and classification methods struggle with the fast evolution of the malware landscape, particularly in terms of the emergence of zero-day malware families and limited representation of single-view features. In this article, a new multiview feature intelligence (MFI) framework is developed to learn the representation of a targeted capability from known malware families for recognizing unknown and evolving malware with the same capability. The new framework performs reverse engineering to extract multiview heterogeneous features, including semantic string features, API call graph features, and smali opcode sequential features. It can learn the representation of a targeted capability from known malware families through a series of processes of feature analysis, selection, aggregation, and encoding, to detect unknown Android malware with shared target capability. We create a new dataset with ground-truth information regarding capability. Many experiments are conducted on the new dataset to evaluate the performance and effectiveness of the new method. The results demonstrate that the new method outperforms three state-of-the-art methods, including: 1) Drebin; 2) MaMaDroid; and 3) N -opcode, when detecting unknown Android malware with targeted capabilities.
- Research Article
30
- 10.1109/tifs.2021.3080510
- Jan 1, 2021
- IEEE Transactions on Information Forensics and Security
This paper presents a signal processing and machine learning (ML) based methodology to leverage Electromagnetic (EM) emissions from an embedded device to remotely detect a malicious application running on the device and classify the application into a malware family. We develop Fast Fourier Transform (FFT) based feature extraction followed by Support Vector Machine (SVM) and Random Forest (RF) based ML models to detect a malware. We further propose methods to learn characteristic behavior of different malwares from EM traces to reveal similarities to known malware families and improve efficiency of malware analysis. We propose to use Discrete Wavelet Transform (DWT) based feature extraction from spectrograms of EM side-channel traces and perform ML on the extracted features to learn fine-grained patterns of malware families. The experimental demonstration on Open-Q 820 development platform demonstrate 0.99 F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> score in detecting malware and 0.88 F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> score in uniquely classifying malwares among 8 malware family evaluated using Support Vector Machines (SVM) and Random Forest (RF) Machine Learning(ML) models. We also demonstrate capability of proposed framework in identifying new unknown applications with 0.99 recall and unknown malware family with 0.87 recall.
- Research Article
50
- 10.1016/j.cels.2020.10.007
- Nov 18, 2020
- Cell systems
Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning.
- Research Article
46
- 10.1007/s11704-017-6493-y
- Jun 30, 2018
- Frontiers of Computer Science
The domination of the Android operating system in the market share of smart terminals has engendered increasing threats of malicious applications (apps). Research on Android malware detection has received considerable attention in academia and the industry. In particular, studies on malware families have been beneficial to malware detection and behavior analysis. However, identifying the characteristics of malware families and the features that can describe a particular family have been less frequently discussed in existing work. In this paper, we are motivated to explore the key features that can classify and describe the behaviors of Android malware families to enable fingerprinting the malware families with these features. We present a framework for signature-based key feature construction. In addition, we propose a frequency-based feature elimination algorithm to select the key features. Finally, we construct the fingerprints of ten malware families, including twenty key features in three categories. Results of extensive experiments using Support Vector Machine demonstrate that the malware family classification achieves an accuracy of 92% to 99%. The typical behaviors of malware families are analyzed based on the selected key features. The results demonstrate the feasibility and effectiveness of the presented algorithm and fingerprinting method.
- Research Article
- 10.47065/josh.v6i1.6053
- Oct 21, 2024
- Journal of Information System Research (JOSH)
Malware poses a significant threat to cybersecurity, particularly for Android users. Each type of malware is categorized into distinct categories and families, each exhibiting unique malicious capabilities. Accurately identifying these categories and families is crucial for developing effective prevention and mitigation strategies, allowing for the control of threats before they worsen. Throughout the years, numerous techniques have been proposed for detecting malware families, with system calls emerging as a vital feature. Collected through dynamic analysis, system calls offer in-depth insights into the activities executed by malware, making them a powerful classification tool. This study aims to enhance the detection of Android malware families and categories by analyzing system calls with feature selection method. Using the Gain Ratio algorithm, significant system calls are identified to improve detection accuracy and reduce the complexity of the feature set. The study assesses machine learning algorithms, particularly Random Forest, J48, Naïve Bayes, and Decision Table. The findings show that Random Forest consistently outperforms other algorithms, achieving an accuracy of 88.01% for malware family detection and 89.65% for category detection, with high precision and recall across most metrics. The application of the Gain Ratio feature selection method led to a 68.83% feature reduction and improved model-building speed by 50.26%. This integration of feature selection and machine learning provides a more effective approach to detecting malware families and categories, thus contributing to enhanced Android security.
- Conference Article
2
- 10.1109/icssa45270.2018.00023
- Jul 1, 2018
Malicious software (Malware) applications in Android ecosystem is one of the critical issues. Manual detection of malware is not cost-effective and cannot keep up with the fast evolution of malware development in Android. A machine learning based malware detection has attempted to automate the detection of malware in Android. In this paper, we present new Android malware detection methods. The main idea of our proposed approach is to use three different feature selection methods before malware detection model using a machine learning algorithm is constructed. We used both Malware Genome Project dataset and our own crawled dataset to show the effectiveness of the proposed methods.
- Research Article
- 10.52783/jisem.v10i15s.2495
- Mar 4, 2025
- Journal of Information Systems Engineering and Management
Introduction: Mobile security suffers greatly by the quick spread of Android malware, leading to the need for sophisticated detection methods that can change to meet new threats. Malware is still a difficult security issue in the Android ecosystem because it frequently obfuscates itself to avoid detection. Semantic behavior feature extraction is essential in this situation in order to build a reliable malware detection model Objectives: To provide an overview of android malware, including the impact of malware detection, the significance of malware detection, its types, and a framework for detecting Android malware that uses Transfer Learning (TL) to forecast malware in the Android ecosystem and AEs (AutoEncoders) to extract features. Methods: This study presents an integrated deep learning (DL) method for Android malware detection that combines AEs for feature extraction and TL for classification. In order to effectively depict both benign and malevolent actions, AEs are used to extract latent, highly dimensional features from both static and dynamic analytical data. Then, TL uses deep neural networks that have already been trained to identify Android apps more accurately and with less training time Results: The tests were conducted on three datasets with two labels in the "class" attribute "0" for benign and "1" for malicious in order to evaluate the effectiveness of the suggested framework. With an enhanced MAE value of 0.001 and RMSE value of 0.063 attain an 99.99% accuracy. The findings show that the proposed model achieved remarkable accuracy and has the potential to produce reliable malware detection results Conclusions: An integrated DL strategy for Android malware detection that combines AEs for feature extraction and TL for classification has been presented. AEs are used to extract high-dimensional, latent characteristics from both static (code-related) and dynamic (behavior-related) Android app analysis data. The system may thus effectively capture and reflect both beneficial and harmful behaviors.
- Conference Article
7
- 10.1109/ntms.2014.6814026
- Mar 1, 2014
The dazzling emergence of cyber-threats exert today's cyberspace, which needs practical and efficient capabilities for malware traffic detection. In this paper, we propose an extension to an initial research effort, namely, towards fingerprinting malicious traffic by putting an emphasis on the attribution of maliciousness to malware families. The proposed technique in the previous work establishes a synergy between automatic dynamic analysis of malware and machine learning to fingerprint badness in network traffic. Machine learning algorithms are used with features that exploit only high-level properties of traffic packets (e.g. packet headers). Besides, the detection of malicious packets, we want to enhance fingerprinting capability with the identification of malware families responsible in the generation of malicious packets. The identification of the underlying malware family is derived from a sequence of application protocols, which is used as a signature to the family in question. Furthermore, our results show that our technique achieves promising malware family identification rate with low false positives.
- Conference Article
29
- 10.1145/3022227.3022306
- Jan 5, 2017
Malware damages computers and the threat is a serious problem. Malware can be detected by pattern matching method or dynamic heuristic method. However, it is difficult to detect all new malware subspecies perfectly by existing methods. In this paper, we propose a new method which automatically detects new malware subspecies by static analysis of execution files and machine learning. The method can distinguish malware from benignware and it can also classify malware subspecies into malware families. We combine static analysis of execution files with machine learning classifier and natural language processing by machine learning. Information of DLL Import, assembly code and hexdump are acquired by static analysis of execution files of malware and benignware to create feature vectors. Paragraph vectors of information by static analysis of execution files are created by machine learning of PV-DBOW model for natural language processing. Support vector machine and classifier of k-nearest neighbor algorithm are used in our method, and the classifier learns paragraph vectors of information by static analysis. Unknown execution files are classified into malware or benignware by pre-learned SVM. Moreover, malware subspecies are also classified into malware families by pre-learned k-nearest. We evaluate the accuracy of the classification by experiments. We think that new malware subspecies can be effectively detected by our method without existing methods for malware analysis such as generic method and dynamic heuristic method.
- Research Article
20
- 10.1109/tc.2022.3143439
- Nov 1, 2022
- IEEE Transactions on Computers
Android malware is an ongoing threat to billions of smart devices’ security, ranging from mobile phones to car infotainment systems. Despite numerous approaches and previous studies to develop solutions for detecting and preventing Android malware, the rapid continuous development of new malware variants requires a careful reconsideration and the development of effective methods to identify malware families given a meager number of malware instances. In this paper, we present DroidMalVet, a novel Android malware family classification and detection approach that does not require to perform complex program analyses or utilize large feature sets. DroidMalVet is the first to use a promising, diverse, and small set of software metrics as features in a supervised learning platform to classify and detect various Android malware families. Our extensive empirical evaluations on two large public malware datasets show that DroidMalVet accurately detects both small and large malware families with F-Score accuracy of 94.4% and 96%, and AUC equal to 99.5% and 99.7% on the malware families in Drebin and AMD datasets, respectively. Moreover, our results demonstrate the superior performance of DroidMalVet in detecting small families (i.e., families with few samples). DroidMalVet complements existing approaches and presents an early warning tool for detecting known and emerging malware families.
- Conference Article
18
- 10.23919/softcom.2018.8555738
- Sep 1, 2018
Malicious software also known as "Malware" is software that uses legitimate instructions or code to perform malicious actions. Malware poses a major threat for computer security and information security in general. Over the years, malware has evolved to the point that a single malware specimen can have hundreds or maybe thousands of variants using polymorphic and metamorphic transformation to change the signature of the malware variant in propagation. The common signature-based malware detection methods are no longer robust to detect these variants due to the alteration of code. Static analysis is required to obtain these signatures and anti-virus companies are required to propagate these signature updates to their software. A faster detection method is needed to compensate the exponentially increasing number of malware variants. Machine learning is a trending approach for malware detection. This removes the need to use signature-based detection and is also faster. Software companies do not need to propagate signatures as often. Machine learning algorithms using opcode sequences can recognise patterns in the malicious code that are not present in common signatures and classify them more efficiently. Therefore, a machine learning approach for malware detection should be adopted for faster and more efficient detection. Most research in malware detection using machine learning used static attributes such as network connections, processes spawned, hashes, etc., that were not that robust to changes. In this paper we introduced our novel approach in using trigrams and PE file attributes as features for malware detection. We took a text mining approach to make our detection method more robust to polymorphism and metamorphism. The instruction sequence for critical code in malware on the assembly level is basically the same across malware families. We used opcode trigram sequences as the main feature for our machine learning algorithm. We used Support Vector Machine(SVM) as our classifying algorithm which is a discriminative classifier model that gives a definite decision whether the predicted outcome belongs to the learned class or not. The above shows our novel approach that enabled us to get higher detection rates with less features.
- Conference Article
5
- 10.1109/sin54109.2021.9699322
- Dec 15, 2021
With the increased popularity and wide adoption of Android as a mobile OS platform, it has been a major target for malware authors. Due to unprecedented rapid growth in the number, variants, and diversity of malware, detecting malware on the Android platform has become challenging. Beyond the detection of a malware, classifying the family the malware belongs to, helps security analysts to reuse malware removal techniques that is known to work for that family of malware. It takes manual analysis if a malware belongs to an unknown family. Therefore, classifying malware into exact family is important. This paper presents a technique and tool named MAPFam that applies machine learning on static features from the Manifest file and API packages to classify an Android malware into its family. This work is premised on a starting hypothesis that features extracted from API packages rather than on API calls lead to more precise classification. Our experiments indeed shows that API package based models provides ∼1.63X more accurate classification compared to an API call based method. Our machine learning based malware family classification system uses API packages, requested permissions, and other features from the Manifest files. The proposed family classification system achieves accuracy and average precision above 97% for the top 60 malware families by using only 81 features with 97.55% of model reliability rate (Kappa score). The experimental results also shows that MAPFam can perfectly identity 36 malware families.
- Conference Article
2
- 10.1109/icsecs52883.2021.00107
- Aug 1, 2021
‘Malicious software’ or malware has been a serious threat to the security and privacy of all mobile phone users. Due to the popularity of smartphones, primarily Android, this makes them a very viable target for spreading malware. In the past, many solutions have proved ineffective and have resulted in many false positives. Having the ability to identify and classify malware will help prevent them from spreading and evolving. In this paper, we study the effectiveness of the proposed classification of the malware family using a pixel level as features. This study has implemented well-known machine learning and deep learning classifiers such as K-Nearest Neighbours (k-NN), Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree, and Random Forest. A binary file of 25 malware families is converted into a fixed grayscale image. The grayscale images were then extracted transforming the size 100x100 into a single format into 100000 columns. During this phase, none of the columns are removed as to remain the patterns in each malware family. The experimental results show that our approach achieved 92% accuracy in Random Forest, 88% in SVM, 81% in Decision Tree, 80% in k-NN and 56% in Naïve Bayes classifier. Overall, the pixel-based feature also reveals a promising technique for identifying the family of malware with great accuracy, especially using the Random Forest classifier.
- Research Article
- 10.3390/electronics11244148
- Dec 12, 2022
- Electronics
Low-resource malware families are highly susceptible to being overlooked when using machine learning models or deep learning models for automated detection because of the small amount of data samples. When we target to train a classifier for a low-resource malware family, the training data using the family itself is not sufficient to train a good classifier. In this work, we study the relationship between different malware families and improve the performance of the malware detection model based on machine learning method in low-resource malware family detection. First, we propose an empirical supportive score to measure the transfer quality and find that transferring performance varies a lot between different malware families. Second, we propose a Sequential Family Selection (SFS) algorithm to select multiple families as the training data. With SFS, we only transfer knowledge from several supportive families to target low-resource families. We conduct experiments on 16 families and 4 malware detection models, the results show that our model could outperform best baselines by 2.29% on average and our algorithm achieves 14.16% improvement in accuracy at the highest. Third, we study the transferred knowledge and find that our algorithm could capture the common characteristics between different malware families by proposing a supportive score and achieve good detection performance in the low-resource malware family. Our algorithm could also be applicable to image detection and signal detection.
- Dissertation
- 10.31979/etd.q2kz-y82c
- Feb 24, 2021
A fundamental problem in malware research consists of malware detection, that is, dis- tinguishing malware samples from benign samples. This problem becomes more challeng- ing when we consider multiple malware families. A typical approach to this multi-family detection problem is to train a machine learning model for each malware family and score each sample against all models. The resulting scores are then used for classification. We refer to this approach as “cold fusion,” since we combine previously-trained models—no retraining of these base models is required when additional malware families are considered. An alternative approach is to train a single model on samples from multiple malware families. We refer to this latter approach as “hot fusion,” since we must completely retrain the model whenever an additional family is included in our training set. In this research, we compare hot fusion and cold fusion—in terms of both accuracy and efficiency—as a function of the number of malware families considered. We use features based on opcodes and a variety of machine learning techniques.