GSEDroid: GNN-based Android malware detection framework using lightweight semantic embedding

Jintao Gu,Hongliang Zhu,Zewei Han,Xiangyu Li,Jianjin Zhao

doi:10.1016/j.cose.2024.103807

Abstract

Currently, the prevalence of Android malware remains substantial. Malicious programs increasingly use advanced obfuscation techniques, posing challenges for security professionals with enhanced disguises, a proliferation of variants, and escalating detection difficulty. Leveraging semantic features presents a promising avenue to address these challenges. Rich semantic information encapsulated within opcodes and API call graphs has been identified as crucial in distinguishing benign from malicious applications. Consequently, various Natural Language Processing (NLP) technologies, such as Word2vec, are employed to encode features of Dalvik opcode sequences, thereby yielding embedded representations. Given that malware developers often opt for semantically similar APIs to achieve comparable functionalities, it is posited that the opcode embeddings for such APIs should exhibit similar characteristics. However, simple NLP models that only extract statistical information are insufficient for understanding obfuscated malware's behavioral patterns, as they do not provide comprehensive semantic insights. To bridge this gap, we propose a novel, lightweight embedding model based on CodeBERT and TextCNN. This model aims for efficient and precise representation of opcode sequences. Consequently, we introduce GSEDroid, an Android malware detection framework that uses an API call graph with permission and opcode semantic features to characterize APKs. This approach converts the detection challenge into a graph classification task executed via a graph neural network algorithm. The efficacy of our method has been validated through comparative analyses with other techniques. Experimental results demonstrate that our GraphSage+SAGPooling model achieved an accuracy of 99.47% and an F1-score of 99.44%, underscoring its effectiveness in Android malware detection.

Full Text