Abstract

The attribution of authorship is required in diverse applications, ranging from ancient novels (Shakespeare's work, Federalist papers) for historical interest to recent novels for linguistic research or even out of curiosity (Robert Galbraith alias J.K.Rowling). For this problem extensive research has resulted in effective general purpose methods. Also, for other types of text the original author needs to be discovered. Especially, we are interested in methods to identify JavaScript programmers, which can be used to reveal the offender who produced malicious software on a website. So far, for this hardly studied problem, mainly general purpose methods from natural language authorship attribution have been applied. Moreover, no suitable reference dataset is available to allow for method evaluation and method development in a supervised machine learning approach. In this work we first obtain a reference dataset of substantial size and quality. Further, we propose to extract structural features from the Abstract Syntax Tree (AST) to describe the coding style of an author. In the experiments, we show that the specifically designed features indeed improve the authorship attribution of scripting code to programmers, especially in addition to character n-gram features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call