Impact of structural weighting on a latent Dirichlet allocation–based feature location technique

Brian P Eddy,Nicholas A Kraft,Jeff Gray

doi:10.1002/smr.1892

Abstract

AbstractText retrieval–based feature location techniques (FLTs) use information from the terms present in documents in classes and methods. However, relevant terms originating from certain locations (eg, method names) often comprise only a small part of the entire method lexicon. Feature location techniques should benefit from techniques that make greater use of this information. The primary objective of this study was to investigate how weighting terms from different locations in source code can improve a latent Dirichlet allocation (LDA)‐based FLT. We conducted an empirical study of 4 subject software systems and 372 features. For each subject system, we trained 1024 different LDA models with new weighting schemes applied to leading comments, method names, parameters, body comments, and local variables. We conducted both a quantitative and qualitative analysis to identify the effects of using the weighting schemes on the performance of the LDA‐based FLT. We evaluated weighting schemes based on mean reciprocal rank and spread of effectiveness measures. In addition, we conducted a factorial analysis to identify which locations have a main impact on the results of the FLT. We then examined the effects of adding information from class comments, class names, and fields to the top 10 configurations for each system. This results in an additional 640 different LDA models for each system. From our results, we identified a significant effect in the performance of an LDA‐based weighting configuration when applying our weighting schemes to the LDA‐based FLT. Furthermore, we found that adding information from each method's containing class can improve the effectiveness of an LDA‐based FLT. Finally, we identified a set of recommendations for identifying better weighting schemes for LDA.

Full Text