Sem2vec : Semantics-aware Assembly Tracelet Embedding

Huaijin Wang,Shi Wu,Sen Nie,Qiyi Tang,Shuai Wang,Pingchuan Ma

doi:10.1145/3569933

Abstract

Binary code similarity is the foundation of many security and software engineering applications. Recent works leverage deep neural networks (DNN) to learn a numeric vector representation (namely, embeddings ) of assembly functions, enabling similarity analysis in the numeric space. However, existing DNN-based techniques capture syntactic-, control flow-, or data flow-level information of assembly code, which is too coarse-grained to represent program functionality. These methods can suffer from low robustness to challenging settings such as compiler optimizations and obfuscations. We present sem2vec , a binary code embedding framework that learns from semantics . Given the control-flow graph (CFG), 34 pages. of an assembly function, we divide it into tracelets , denoting continuous and short execution traces that are reachable from the function entry point. We use symbolic execution to extract symbolic constraints and other auxiliary information on each tracelet. We then train masked language models to compute embeddings of symbolic execution outputs. Last, we use graph neural networks, to aggregate tracelet embeddings into the CFG-level embedding for a function. Our evaluation shows that sem2vec extracts high-quality embedding and is robust against different compilers, optimizations, architectures, and popular obfuscation methods including virtualization obfuscation. We further augment a vulnerability search application with embeddings computed by sem2vec and demonstrate a significant improvement in vulnerability search accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sem2vec : Semantics-aware Assembly Tracelet Embedding

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Software Engineering and Methodology

Lead the way for us

Journal: ACM Transactions on Software Engineering and Methodology	Publication Date: May 27, 2023
Citations: 3

Similar Papers

Enhancing DNN-Based Binary Code Function Search With Low-Cost Equivalence Checking
Huaijin Wang ... Zhibo Liu
IEEE Transactions on Software Engineering | VOL. 49
Huaijin Wang, et. al.Huaijin Wang ... Zhibo Liu
01 Jan 2023
IEEE Transactions on Software Engineering | VOL. 49

FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN
Hao Gao ... Lina Wang
Symmetry | VOL. 14
Hao Gao, et. al.Hao Gao ... Lina Wang
02 Dec 2022
Symmetry | VOL. 14

An Inclusive Report on Robust Malware Detection and Analysis for Cross-Version Binary Code Optimizations
S Poornima, R Mahalakshmi
International Journal on Recent and Innovation Trends in Computing and Communication | VOL. 11
S Poornima, R MahalakshmiS Poornima, R Mahalakshmi
30 Oct 2023
International Journal on Recent and Innovation Trends in Computing and Communication | VOL. 11

Codeformer: A GNN-Nested Transformer Model for Binary Code Similarity Detection
Guangming Liu ... Feng Yue
Electronics | VOL. 12
Guangming Liu, et. al.Guangming Liu ... Feng Yue
04 Apr 2023
Electronics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sem2vec : Semantics-aware Assembly Tracelet Embedding

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Software Engineering and Methodology