CoDesc: A Large Code–Description Parallel Dataset

Masum Hasan ,Rifat Shahriyar

doi:10.48448/aby3-yx67

Abstract

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc - a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code--description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22\% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at \url{https://github.com/csebuetnlp/CoDesc}.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CoDesc: A Large Code–Description Parallel Dataset

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning
Wei Ye ... Jinglei Zhang
-
Wei Ye, et. al.Wei Ye ... Jinglei Zhang
20 Apr 2020
20 Apr 2020

Enhancing code summarization with action word prediction
Mingchen Li ... Zijie Huang
Neurocomputing | VOL. 563
Mingchen Li, et. al.Mingchen Li ... Zijie Huang
16 Oct 2023
Neurocomputing | VOL. 563

A Multi-Module Based Method for Generating Natural Language Descriptions of Code Fragments
Xuejian Gao ... Xiao Wang
IEEE Access | VOL. 9
Xuejian Gao, et. al.Xuejian Gao ... Xiao Wang
01 Jan 2020
IEEE Access | VOL. 9

Do Code Summarization Models Process Too Much Information? Function Signature May Be All That Is Needed
Xi Ding ... Rui Peng
ACM Transactions on Software Engineering and Methodology | VOL. 33
Xi Ding, et. al.Xi Ding ... Rui Peng
27 Jun 2024
ACM Transactions on Software Engineering and Methodology | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CoDesc: A Large Code–Description Parallel Dataset

Abstract

Talk to us

Similar Papers