Enhanced Feature Selection Using Word Embeddings for Self-Admitted Technical Debt Identification

Jernej Flisar,Vili Podgorelec

doi:10.1109/seaa.2018.00045

Abstract

Technical debt (TD) is a term used to describe a trade off between code quality and timely software release. Since technical debt has negative impact on software development, identification of such debt is an important task in the software engineering domain. Sometimes, technical debt is annotated in source code comments. This kind of debt is referred to as self-admitted technical debt (SATD). Recently, some studies have focused on automated detection and classification of SATD using natural language processing methods. However, these methods have only used manually annotated data to train their classifiers. In this paper, we present the results of a performed exploratory study for using large corpus of unlabeled code comments, extracted from open source projects on git-hub, to train word embeddings, in order to improve detection of SATD. Our approach aims to enhance the feature selection method by taking advantage of the pre-trained word embeddings to detect similar features in source code comments. The experimental results show a significant improvement in SATD classification. With achieved 82% of correct predictions of SATD, the method seems to be a good candidate to be adopted in practice.

Full Text