Dataset Characteristics for Reliable Code Authorship Attribution

Farzaneh Abazari,Norah Ridley,Mila Dalla Preda,Natalia Stakhanova,Enrico Branca

doi:10.1109/tdsc.2021.3138700

Farzaneh Abazari, Norah Ridley + Show 3 more

Open Access

https://doi.org/10.1109/tdsc.2021.3138700

Copy DOI

Abstract

Code authorship attribution aims to identify the author of software source code according to the author’s unique coding style characteristics. The lack of benchmark data in the field, forced researchers to employ various resources that often did not reflect real programming practices. Throughout the years, research studies have used textbook examples, students’ programming assignments, faculty code samples, code from programming competitions and files retrieved from open-source repositories as research objects. The diversity of the data raised concerns about the feasibility of capturing the appropriate data characteristics to reliably evaluate code attribution. In this paper, we investigate these concerns and analyze the effect of the dataset characteristics and feature elimination techniques on the accuracy of code attribution. Unlike the majority of the work done in this field, which mainly concentrates on designing new features, we explore the nature of the data used in previous studies and assess the factors that influence the attribution task. Within this analysis, we investigate the robustness of three feature sets regarded as reliable benchmarks in the attribution research. Based on our findings, we define a process for deriving a reduced set of features for accurate and predictable attribution and make recommendations on the dataset characteristics.

Full Text