Commit2Vec: Learning Distributed Representations of Code Changes

Rocío Cabrera Lozoya,Michele Bezzi,Antonino Sabetta,Arnaud Baumann

doi:10.1007/s42979-021-00566-z

Rocío Cabrera Lozoya, Michele Bezzi + Show 2 more

Open Access

PDF Available

https://doi.org/10.1007/s42979-021-00566-z

Copy DOI

Export

Save

Cite

Journal: SN Computer Science	Publication Date: Mar 19, 2021
Citations: 24

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Deep learning methods have found successful applications in fields like image classification and natural language processing. They have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach for source code representation, which uses information about its syntactic structure, and we extend it to represent source code changes (i.e., commits). We use this representation to tackle an industrial-relevant task: the classification of security-relevant commits. We leverage on transfer learning, a machine learning technique which reuses, or transfers, information learned from previous tasks (commonly called pretext tasks) to tackle a new target task. We assess the impact of using two different pretext tasks, for which abundant labeled data is available, to tackle the classification of security-relevant commits. Our results indicate that representations that exploit the structural information in code syntax outperform token-based representations. Furthermore, we show that pre-training on a small dataset ( $$>10^4$$ samples), but for a pretext task that is closely related to the target task, results in better performance metrics than pre-training on a loosely related pretext task with a very large dataset ( $$>10^6$$ samples).

Full Text