Beyond Literal Meaning: Uncover and Explain Implicit Knowledge in Code Through Wikipedia-Based Concept Linking

Chong Wang,Xiujie Meng,Zhenchang Xing,Xin Peng

doi:10.1109/tse.2023.3250029

Abstract

When reusing or modifying code, developers need to understand the implicit knowledge behind a piece of code in addition to the literal meaning of code. Such implicit knowledge involves related concepts and their explanations. Uncovering and understanding the implicit knowledge in code are challenging due to the extensive use of abbreviations, scattered expressions of concepts, and ambiguity of concept mentions. In this paper, we propose an automatic approach (called CoLiCo) that can uncover implicit concepts in code and link the uncovered concepts to Wikipedia. Based on a trained identifier embedding model, CoLiCo identifies Wikipedia concepts mentioned in a given code snippet and excerpts a paragraph-level explanation from Wikipedia for each concept. During the process, CoLiCo resolves identifier abbreviation (i.e., concepts mentioned in the form of abbreviations) and identifier aggregation (i.e., concepts mentioned by an aggregation of multiple identifiers) based on identifier embedding and mining of identifier abbreviation/aggregation relations. Experimental study shows that CoLiCo outperforms a general entity linking approach by 38.7% in the correctness of concept linking and identifies 96.7% more correct concept linkings on a dataset with 629 code snippets. The concept linking is significant for program understanding in 54% code snippets. Our user study shows that CoLiCo can significantly shorten the time and improve the correctness in code comprehension tasks that intensively involve implicit knowledge.

Full Text