Language and Obfuscation Oblivious Source Code Authorship Attribution

Sarim Zafar,Muhammad Zubair Malik,Saeed Salem,Muhammad Usman Sarwar

doi:10.1109/access.2020.3034932

Sarim Zafar, Muhammad Zubair Malik + Show 2 more

Open Access

https://doi.org/10.1109/access.2020.3034932

Copy DOI

Abstract

Source Code Authorship Attribution can answer many interesting questions such as: Who wrote the malicious source code? Is the source code plagiarized, and does it infringe on copyright? Source Code Authorship Attribution is done by observing distinctive patterns of style in a source code whose author is unknown and comparing them with patterns learned from known authors' source codes. In this paper, we present an efficient approach to learn a novel representation using deep metric learning. The existing state of the art approaches tokenize the source code and work on the keyword level, limiting the elements of style they can consider. Our approach uses the raw character stream of source code. It can examine keywords and different stylistic features such as variable naming conventions or using tabs vs. spaces, enabling us to learn a richer representation than other keyword-based approaches. Our approach uses a character-level Convolutional Neural Network (CNN). We train the CNN to map the input character stream to a dense vector, mapping the source codes authored by the same author close to each other. In contrast, source codes written by different programmers are mapped farther apart in the embedding space. We then feed these source code vectors into the K-nearest neighbor (KNN) classifier that uses Manhattan-distance to perform authorship attribution. We validated our approach on Google Code Jam (GCJ) dataset across three different programming languages. We prepare our large-scale dataset in such a way that it does not induce type-I error. Our approach is more scalable and efficient than existing methods. We were able to achieve an accuracy of 84.94% across 20,458 authors, which is more than twice the scale of any previous study under a much more challenging setting.

Highlights

Source code often contains distinctive patterns that represent a programmer’s 1 style of writing code
Source code authorship attribution has primarily relied on feature engineering, where unique features are associated with each author, such as variable naming conventions, use of for, or while loop
We demonstrate that our proposed framework can identify authors writing in individual programming languages and even the authors who write in multiple programming languages

Summary

Introduction

Source code often contains distinctive patterns that represent a programmer’s 1 style of writing code. The source code authorship attribution aims to extract these patterns from the source code and identify the author. Source code authorship attribution has primarily relied on feature engineering, where unique features are associated with each author, such as variable naming conventions, use of for, or while loop. Extracting such features is time-consuming and challenging. With continuous learning and Source code authorship attribution has numerous applications in the information security domain, such as identifying malicious source code authors, plagiarism detection [1], and resolving copyrights infringement [2]. The authorship identification problem largely depends on each author’s unique features, such as variable naming conventions, use of for, or while loop. This study used representation learning, where we train CNN using lifted structured loss function to extract meaningful feature vectors

Results

Discussion

Conclusion