Topic modeling in software engineering research

Camila Costa Silva,Matthias Galster,Fabian Gilson

doi:10.1007/s10664-021-10026-0

Camila Costa Silva, Matthias Galster + Show 1 more

Open Access

PDF Available

https://doi.org/10.1007/s10664-021-10026-0

Copy DOI

Export

Save

Cite

Journal: Empirical Software Engineering	Publication Date: Sep 6, 2021
Citations: 46	License type: open-access

Affiliation: University of Canterbury

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Topic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic “topics” (i.e., word clusters) from a corpus of textual documents. In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.g., to find out what developers talk about online), but also to build new techniques to support software engineering tasks (e.g., to support source code comprehension). Topic modeling needs to be applied carefully (e.g., depending on the type of textual data analyzed and modeling parameters). Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was “prepared” (i.e., pre-processed) for topic modeling, and (4) how generated topics (i.e., word clusters) were named to give them a human-understandable meaning. We analyzed topic modeling as applied in 111 papers from ten highly-ranked software engineering venues (five journals and five conferences) published between 2009 and 2020. We found that (1) LDA and LDA-based techniques are the most frequent topic modeling techniques, (2) developer communication and bug reports have been modelled most, (3) data pre-processing and modeling parameters vary quite a bit and are often vaguely reported, and (4) manual topic naming (such as deducting names based on frequent words in a topic) is common.

Highlights

Text mining is about searching, extracting and processing text to provide meaningful insights from the text based on a certain goal
Considering the limitations of topic modeling techniques and topic models on the one hand and their potential usefulness in software engineering on the other hand, our goal is to describe how topic modeling has been applied in software engineering research
We identify characteristics and limitations in the use of topic models and discuss (a) the appropriateness of topic modeling techniques, (b) the importance of pre-processing, (c) challenges related to defining meaningful topics, and (d) the importance of context when manually naming topics

Summary

Introduction

Text mining is about searching, extracting and processing text to provide meaningful insights from the text based on a certain goal. Text mining has been widely used in software engineering research (Bi et al 2018), for example, to uncover architectural design decisions in developer communication (Soliman et al 2016) or to link software artifacts to source code (Asuncion et al 2010). An advantage of topic modeling over other techniques is that it helps analyzing long texts (Treude and Wagner 2019; Miner et al 2012), creates clusters as “topics” (rather than individual words) and is unsupervised (Miner et al 2012). Topic modeling has become popular in software engineering research (Sun et al 2016; Chen et al 2016). Sun et al (2016) found that topic modeling had been used to support source code comprehension, feature location and defect prediction. Chen et al (2016) found that many repository mining studies apply topic modeling to textual data such as source code and log messages to recommend code refactoring (Bavota et al 2014b) or to localize bugs (Lukins et al 2010)

Objectives

Methods

Conclusion