Defects in software may result in system crashes, sluggish performance, or even deadlock, leading to the depletion of valuable resources. Implementing defect prediction can assist quality assurance teams in identifying potential software issues and rationalizing the allocation of testing resources, thereby decreasing the elimination of resources and enhancing software sustainability. Researchers have recently incorporated deep learning into defect prediction, extracting structural-semantic features from codes' abstract syntax trees (ASTs). However, inappropriate node granularity in ASTs may adversely impact the effectiveness of the extracted features. In addition, converting AST nodes into integer vectors may lead to the loss of structure information, resulting in poor model predictive capability. This paper proposes a tree-based encoding method with hybrid granularity for defect prediction to address these challenges. Specifically, five granularity selection schemes are extended to generate various ASTs from codes. Subsequently, a tree-based continuous bag-of-words model is utilized to map nodes of ASTs into numeric vector representations that conform to the tree-like structure of codes. The matrices converted from ASTs are then fed into a convolutional neural network to extract program features automatically. Experiments involving 24 versions of open-source projects demonstrate that our method can improve the effectiveness of extracted features in defect prediction tasks.
Read full abstract