Разработка методики идентификации авторства бинарных и дизассемблированных кодов программы на основе ансамбля современных методов обработки естественного языка

Anna V Kurtukova,Alexandr A Shelupanov,Aleksandr S Romanov,Tomsk State University Of Control Systems And Radioelectronics (Tusur)

doi:10.21293/1818-0442-2023-26-4-53-60

Abstract

This article is part of a series of studies aimed at solving problems of identifying the authorship of source code. The analysis of binary or disassembled code is a critical task in information security, software development, and computer forensics due to the need to protect intellectual property and copyright, as well as to identify the authors of malware. Any program is a machine code that can be disassembled (converted into text in assembly language) using specialized tools and analyzed for authorship by analogy with text in natural language. To solve this problem, the article proposes a technique based on the fastText ensemble, support vector machine (SVM) and the author-developed hybrid neural network. The proposed methodology was evaluated on source codes in C and C++ languages, collected from the GitHub and Google Code Jam platforms, compiled into executable files and disassembled using reverse engineering tools. The average accuracy of identifying the author of disassembled code using the proposed method was more than 0.9. The technique was also tested on source codes, resulting in an average accuracy of 0.96 in simple cases and more than 0.85 in complex cases (obfuscation, coding standards, etc.).

Full Text