Abstract
Deep learning methods have found successful applications in fields like image classification and natural language processing. They have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach for source code representation, which uses information about its syntactic structure, and we extend it to represent source code changes (i.e., commits). We use this representation to tackle an industrial-relevant task: the classification of security-relevant commits. We leverage on transfer learning, a machine learning technique which reuses, or transfers, information learned from previous tasks (commonly called pretext tasks) to tackle a new target task. We assess the impact of using two different pretext tasks, for which abundant labeled data is available, to tackle the classification of security-relevant commits. Our results indicate that representations that exploit the structural information in code syntax outperform token-based representations. Furthermore, we show that pre-training on a small dataset ( $$>10^4$$ samples), but for a pretext task that is closely related to the target task, results in better performance metrics than pre-training on a loosely related pretext task with a very large dataset ( $$>10^6$$ samples).
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have