Abstract

Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.