Abstract

Information Retrieval (IR) plays a key role in diverse Software Engineering (SE) tasks. Similarity metric is the core component of any IR techniques whose performance differs for various document types. Different SE tasks operate on different types of documents like bug reports, software descriptions, source code, etc., that often contain non-standard domain-specific vocabulary. Thus, it is important to understand which similarity metrics are suitable for different SE documents. We analyze the performance of different similarity metrics on various SE documents including a diverse combination of textual (e.g., description, readme), code (e.g., source code, API, import package), and a mixture of text and code (e.g., bug reports) artifacts. We observe that, in general, the context-aware IR models achieve better performance on textual artifacts. In contrast, simple keyword-based bag-of-words models perform better in code artifacts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call