Comparing techniques for authorship attribution of source code

Steven Burrows,Andrew Turpin,Alexandra L Uitdenbogerd

doi:10.1002/spe.2146

Abstract

Attributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non-natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of n tokens/bytes (n-grams) orsoftware metrics; and the classification technique that exploits those features, either information retrieval ranking or machine learning. The results of existing studies, however, are not directly comparable as all use different test beds and evaluation methodologies, making it difficult to assess which approach is superior. This paper summarises all previous techniques to source code authorship attribution, implements feature sets that are motivated by the literature, and applies information retrieval ranking methods or machine classifiers for each approach. Importantly, all approaches are tested on identical collections from varying programming languages and author types. Our conclusions are as follows: (i) ranking and machine classifier approaches are around 90% and 85% accurate, respectively, for a one-in-10 classification problem; (ii) the byte-level n-gram approach is best used with different parameters to those previously published; (iii) neural networks and support vector machines were found to be the most accurate machine classifiers of the eight evaluated; (iv) use of n-gram features in combination with machine classifiers shows promise, but there are scalability problems that still must be overcome; and (v) approaches based on information retrieval techniques are currently more accurate than approaches based on machine learning.

Full Text