Abstract

Binary code similarity detection (BCSD) plays a crucial role in various computer security applications, including vulnerability detection, malware detection, and software component analysis. With the development of the Internet of Things (IoT), there are many binaries from different instruction architecture sets, which require BCSD approaches robust against different architectures. In this study, we propose a novel IoT-oriented binary code similarity detection approach. Our approach leverages a customized transformer-based language model with disentangled attention to capture relative position information. To mitigate out-of-vocabulary (OOV) challenges in the language model, we introduce a base-token prediction pre-training task aimed at capturing basic semantics for unseen tokens. During function embedding generation, we integrate directed jumps, data dependency, and address adjacency to capture multiple block relations. We then assign different weights to different relations and use multi-layer Graph Convolutional Networks (GCN) to generate function embeddings. We implemented the prototype of IoTSim. Our experimental results show that our proposed block relation matrix improves IoTSim with large margins. With a pool size of 103, IoTSim achieves a recall@1 of 0.903 across architectures, outperforming the state-of-the-art approaches Trex, SAFE, and PalmTree.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.