Source code analysis with LDA

Daniel Heinz,David Binkley,Justin Overfelt,Dawn Lawrie

doi:10.1002/smr.1802

Abstract

AbstractLatent Dirichlet allocation (LDA) has seen increasing use in the understanding of source code and its related artifacts in part because of its impressive modeling power. However, this expressive power comes at a cost: The technique includes several tuning parameters whose impact on the resulting LDA model must be carefully considered. The aim of this work is to provide insights into the tuning parameters' impact. Doing so improves the comprehension of both researchers who look to exploit the power of LDA in their research and those who interpret the output of LDA‐using tools. It is important to recognize that the goal of this work is not to establish values for the tuning parameters because there is no universal best setting. Rather, appropriate settings depend on the problem being solved, the input corpus (in this case, typically words from the source code and its supporting artifacts), and the needs of the engineer performing the analysis. This work's primary goal is to aid software engineers in their understanding of the LDA tuning parameters by demonstrating numerically and graphically the relationship between the tuning parameters and the LDA output. A secondary goal is to enable more informed setting of the parameters. Copyright © 2016 John Wiley & Sons, Ltd.

Full Text