Which similarity metric to use for software documents?

Md Masudur Rahman,Baishakhi Ray,Saikat Chakraborty

doi:10.1145/3183440.3194997

Abstract

Information Retrieval (IR) plays a key role in diverse Software Engineering (SE) tasks. Similarity metric is the core component of any IR techniques whose performance differs for various document types. Different SE tasks operate on different types of documents like bug reports, software descriptions, source code, etc., that often contain non-standard domain-specific vocabulary. Thus, it is important to understand which similarity metrics are suitable for different SE documents. We analyze the performance of different similarity metrics on various SE documents including a diverse combination of textual (e.g., description, readme), code (e.g., source code, API, import package), and a mixture of text and code (e.g., bug reports) artifacts. We observe that, in general, the context-aware IR models achieve better performance on textual artifacts. In contrast, simple keyword-based bag-of-words models perform better in code artifacts.

Full Text