Abstract

Binary code similarity is the foundation of many security and software engineering applications. Recent works leverage deep neural networks (DNN) to learn a numeric vector representation (namely, embeddings ) of assembly functions, enabling similarity analysis in the numeric space. However, existing DNN-based techniques capture syntactic-, control flow-, or data flow-level information of assembly code, which is too coarse-grained to represent program functionality. These methods can suffer from low robustness to challenging settings such as compiler optimizations and obfuscations. We present sem2vec , a binary code embedding framework that learns from semantics . Given the control-flow graph (CFG), 34 pages. of an assembly function, we divide it into tracelets , denoting continuous and short execution traces that are reachable from the function entry point. We use symbolic execution to extract symbolic constraints and other auxiliary information on each tracelet. We then train masked language models to compute embeddings of symbolic execution outputs. Last, we use graph neural networks, to aggregate tracelet embeddings into the CFG-level embedding for a function. Our evaluation shows that sem2vec extracts high-quality embedding and is robust against different compilers, optimizations, architectures, and popular obfuscation methods including virtualization obfuscation. We further augment a vulnerability search application with embeddings computed by sem2vec and demonstrate a significant improvement in vulnerability search accuracy.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call