Code comment generation typically refers to the process of generating concise natural language descriptions for a piece of code, which facilitates program comprehension activities. Inline code comments, as a part of code comments, are also crucial for program comprehension. Recently, the emergence of large language models (LLMs) has significantly boosted the performance of natural language processing tasks. This naturally inspires us to explore the performance of the LLMs in the task of inline code comment generation. To this end, we evaluate open-source LLMs on a large-scale dataset and compare the results with the current state-of-the-art methods. Specifically, we explore the model performance in the following scenarios based on the widely used evaluation metrics (i.e. BLEU, Meteor, and ROUGE-L): (1) generation with simple instruction; (2) few-shot-guided generation with random examples selected from the database; (3) few-shot-guided generation with similar examples selected from the database; and (4) adopt the re-ranking strategy for the output of LLMs. Our findings reveal that: (1) under the simple instruction scenario, LLMs could not fully show the potential in the task of inline comment generation compared to the state-of-the-art models; (2) random few-shot leads to a slight improvement; (3) similar few-shot and re-ranking strategy could significantly enhance the performance of LLMs; and (4) for inline comment and code snippet pairs with different intents, why category achieves the best performance and what category achieves relatively poorer performance. That remains consistent across all four scenarios. Our findings shed light on future research directions for using LLMs in inline comment generation tasks.
Read full abstract