Identification Author of Source Code by Machine Learning Methods

Alexander Romanov,Anna Kurtukova

doi:10.15622/sp.2019.18.3.741-765

Abstract

The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property. The paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network. The experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects – Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors. The experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and it’s at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.

Highlights

Данная проблема может быть решена путем увеличения размера корпуса для каждого автора
The paper is devoted to the analysis of the problem of determining the source code author, which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property
The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network

Summary

МЕТОДАМИ МАШИННОГО ОБУЧЕНИЯ

Предлагаются две новые методики идентификации на основе алгоритмов машинного обучения: машины опорных векторов, фильтра быстрой корреляции и информативных признаков; гибридной сверточно-рекуррентной нейронной сети. Методики идентификации автора исходного кода позволяют проверять работы студентов на плагиат по дисциплинам, связанным с программированием. В дальнейшем методика была усовершенствована калибровочными кривыми для анализа неполных и некомпилируемых образцов кода, что позволило авторам получить точность 73% при наличии лишь одного образца исходного кода автора, программирующего на C++ [6]. В статье [8] предлагается система идентификации автора исходного кода на основе глубокого обучения (DL-CAIS), позволяющая осуществлять идентификацию независимо от языка программирования и обфускации. Рассмотренные научные труды позволяют сделать вывод о безусловной эффективности различных методов машинного обучения (МО) при решении задачи идентификации автора исходного кода. Обобщенная методика идентификации автора исходного кода на основе рассмотренной модели SVM представлена на рисунке 1.

Программная система для идентификации автора

Среднее количество строк в методах

Тензор анонимного исходного

Количество авторов в наборе

КОЛИЧЕСТВО АВТОРОВ

HNN SVM

Findings

Статические и динамические метрики

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Труды СПИИРАН	Publication Date: Jun 4, 2019
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Identification Author of Source Code by Machine Learning Methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Труды СПИИРАН

Lead the way for us

Similar Papers

A fast and efficient python library for interfacing with the Biological Magnetic Resonance Data Bank
Andrey Smelter ... Hunter N B Moseley
BMC Bioinformatics | VOL. 18
Andrey Smelter, et. al.Andrey Smelter ... Hunter N B Moseley
17 Mar 2017
BMC Bioinformatics | VOL. 18

Cooperation Prospects between the EAEU Member States in the Field of Information Security
M Yu Ilyina
EURASIAN INTEGRATION: economics, law, politics | VOL. 16
M Yu IlyinaM Yu Ilyina
29 Mar 2022
EURASIAN INTEGRATION: economics, law, politics | VOL. 16

ИССЛЕДОВАНИЯ В СФЕРЕ ОБРАЗОВАНИЯ В ОБЛАСТИ КОМПЬЮТЕРНОЙ ИНФОРМАЦИОННОЙ БЕЗОПАСНОСТИ
Hava S Khalieva ... Bulat E Elezhbiev
EKONOMIKA I UPRAVLENIE: PROBLEMY, RESHENIYA | VOL. 5/5
Hava S Khalieva, et. al.Hava S Khalieva ... Bulat E Elezhbiev
01 Jan 2024
EKONOMIKA I UPRAVLENIE: PROBLEMY, RESHENIYA | VOL. 5/5

GUIs without Pain – the Declarative Way
Mariusz Trzaska
-
Mariusz TrzaskaMariusz Trzaska
01 May 2010
01 May 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identification Author of Source Code by Machine Learning Methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Труды СПИИРАН