A Transformer-based approach to highly granular source code authorship attribution

Chongzheng Shi

doi:10.54254/2755-2721/78/20240686

Abstract

Traditional source code authorship identification methods often rely on features such as textual similarity, programming style or metadata, however, these methods often struggle to extract the precise source code authoring style when dealing with large-scale code bases or complex programming patterns, resulting in poor performance. Therefore, this paper proposes a Transformer-based high fine-grained source code author attribution method.Aiming at the problem of roughness of existing literature on word segmentation, this paper proposes a high fine-grained source code segmentation method to extract higher fine-grained features. Aiming at the problem of feature dimension redundancy, this paper adopts the Transformer network to locate sensitive features that can characterise the author's style.To verify the effectiveness of the model, it was tested on GCJ-C++ and GCJ-Java datasets. The experimental results show that the proposed method model achieves higher recognition accuracy in the source code authorship attribution problem compared to the traditional methods.

Full Text