Automated Commit Intelligence by Pre-training

Shangqing Liu,Guozhu Meng,Yang Liu,Yanzhou Li,Xiaofei Xie,Wei Ma

doi:10.1145/3674731

Abstract

GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder-decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automated Commit Intelligence by Pre-training

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Software Engineering and Methodology

Lead the way for us

Similar Papers

Slice-Based Code Change Representation Learning
Fengyi Zhang ... Bihuan Chen
-
Fengyi Zhang, et. al.Fengyi Zhang ... Bihuan Chen
01 Mar 2023
01 Mar 2023

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
Yue Wang ... Shafiq Joty
-
Yue Wang, et. al.Yue Wang ... Shafiq Joty
01 Jan 2020
01 Jan 2020

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

-

15 Oct 2021
15 Oct 2021

Measuring Task Similarity and Its Implication in Fine-Tuning Graph Neural Networks
Renhong Huang ... Jiarong Xu
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Renhong Huang, et. al.Renhong Huang ... Jiarong Xu
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automated Commit Intelligence by Pre-training

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Software Engineering and Methodology