The widespread reuse of open-source code amplifies the impact of vulnerabilities. Current vulnerability detection methods predominantly rely on binary code similarity comparisons, which involve disassembling to obtain assembly code or control flow graphs. These methods depend on specific disassembly tools and complex preprocessing, limiting their applicability and detection speed. This paper proposes UniBin, a vulnerability detection method based on the multi-layer Transformer encoder. By employing bidirectional LM, unidirectional LM, and sequence-to-sequence LM tasks on both binary and assembly code during the pre-training phase, UniBin learns richer semantic information from binary machine code, enabling efficient similarity comparison without disassembly and mitigating the limitations of disassembly. We cross-compile 55 widely used open-source C projects as datasets. After 52 hours of pre-training and 8 hours of fine-tuning, UniBin reaches an average accuracy of 98.3% in similarity detection across compilation conditions, outperforming the state-of-the-art method. For search tasks across optimization options with a pool size of 1000, the Recall@1 metric improves by 28.2% (from 67.9% to 87.1%). UniBin eliminates dependency on specific disassembly tools and improves end-to-end binary analysis speed by over 36%. In real-world vulnerability detection tasks, UniBin detects all vulnerability functions with the lowest false positive rate of 0.16%.
Read full abstract