Abstract
The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property.
 The paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network.
 The experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects – Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors.
 The experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and it’s at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.
Highlights
Данная проблема может быть решена путем увеличения размера корпуса для каждого автора
The paper is devoted to the analysis of the problem of determining the source code author, which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property
The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network
Summary
Предлагаются две новые методики идентификации на основе алгоритмов машинного обучения: машины опорных векторов, фильтра быстрой корреляции и информативных признаков; гибридной сверточно-рекуррентной нейронной сети. Методики идентификации автора исходного кода позволяют проверять работы студентов на плагиат по дисциплинам, связанным с программированием. В дальнейшем методика была усовершенствована калибровочными кривыми для анализа неполных и некомпилируемых образцов кода, что позволило авторам получить точность 73% при наличии лишь одного образца исходного кода автора, программирующего на C++ [6]. В статье [8] предлагается система идентификации автора исходного кода на основе глубокого обучения (DL-CAIS), позволяющая осуществлять идентификацию независимо от языка программирования и обфускации. Рассмотренные научные труды позволяют сделать вывод о безусловной эффективности различных методов машинного обучения (МО) при решении задачи идентификации автора исходного кода. Обобщенная методика идентификации автора исходного кода на основе рассмотренной модели SVM представлена на рисунке 1.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.