ICG: A Machine Learning Benchmark Dataset and Baselines for Inline Code Comments Generation Task

Xiaowei Zhang,Yuming Zhou,Yanhui Li,Hao Ren,Yulu Cao,Lin Chen,Weiqin Zou,Zhi Wang

doi:10.1142/s0218194023500547

Abstract

As a fundamental component of software documentation, code comments could help developers comprehend and maintain programs. Several datasets of method header comments have been proposed in previous studies for machine learning-based code comment generation. As part of code comments, inline code comments are also crucial for code understanding activities. However, unlike method header comments written in a standard format and describing the whole method code, inline comments are often written in arbitrary formats by developers due to timelines pressures and different aspects of code snippets in the method are described. Currently, there is no large-scale dataset used for inline comments generation considering these. Hence, this naturally inspires us to explore whether we can construct a dataset to foster machine learning research that not only performs fine-grained noise-cleaning but conducts a taxonomy of inline comments. To this end, we first collect inline comments and code snippets from 8000 Java projects on GitHub. Then, we conduct a manual review to obtain heuristic rules, which could be used to clean the data noise in a fine-grained manner. As a result, we construct a large-scale benchmark dataset named ICG with 5,740,770 pairs of inline comments and code snippets. We then build a comprehensive taxonomy and conduct a statistical and manual analysis to explore the performances of different categories of inline comments, such as helpfulness in code understanding. After that, we provide and compare several baseline models to automatically generate inline comments, such as CodeBERT, to enhance the usability of the benchmark for researchers. The availability of our benchmark and baselines can help develop and validate new inline comment generation methods, which would also further facilitate code understanding activities.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

ICG: A Machine Learning Benchmark Dataset and Baselines for Inline Code Comments Generation Task

Abstract

Talk to us

Similar Papers

More From: International Journal of Software Engineering and Knowledge Engineering

Lead the way for us

Journal: International Journal of Software Engineering and Knowledge Engineering	Publication Date: Oct 20, 2023
Citations: 1

Similar Papers

Snippet Comment Generation Based on Code Context Expansion
Hanyang Guo ... Yanlin Wang
ACM Transactions on Software Engineering and Methodology | VOL. -
Hanyang Guo, et. al.Hanyang Guo ... Yanlin Wang
31 Jul 2023
ACM Transactions on Software Engineering and Methodology | VOL. -

CCCS: Contrastive Cross-Language Code Search Using Code Graph Information
Li Kuang ... Honghao Gao
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -
Li Kuang, et. al.Li Kuang ... Honghao Gao
06 Nov 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -

Multi-Intent Inline Code Comment Generation via Large Language Model
Xiaowei Zhang ... Lin Chen
International Journal of Software Engineering and Knowledge Engineering | VOL. -
Xiaowei Zhang, et. al.Xiaowei Zhang ... Lin Chen
23 Mar 2024
International Journal of Software Engineering and Knowledge Engineering | VOL. -

Towards automatically generating block comments for code snippets
Yuan Huang ... Xiapu Luo
Information and software technology | VOL. 127
Yuan Huang, et. al.Yuan Huang ... Xiapu Luo
04 Jul 2020
Information and software technology | VOL. 127

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ICG: A Machine Learning Benchmark Dataset and Baselines for Inline Code Comments Generation Task

Abstract

Talk to us

Similar Papers

More From: International Journal of Software Engineering and Knowledge Engineering