Improvements to code2vec: Generating path vectors using RNN

Xuekai Sun,Chunling Liu,Weiyu Dong,Tieming Liu

doi:10.1016/j.cose.2023.103322

Abstract

Source code analysis has many application scenarios, such as code plagiarism detection and software vulnerability search. Source code analysis can benefit from machine learning, but it typically requires a standard vector representation and cannot be directly applied to the source code. Thus, we are required to embed source code into vector representation while maintaining the semantics of the code as much as possible. Code2vec proposes a code embedding method that converts source code into code vector through Abstract Syntax Tree(AST). However, we found that code2vec uses a hashing algorithm to generate the identifier for the path in the path context, which leads to the loss of node information in the path and also causes the model training parameters to be very large. Therefore, we present a new path representation which utilizes RNN to generate vectors for paths. We also proposed alternative model designs and evaluated their impact on the model in the experiments. The results we obtained in a challenging source code classification task suggest that, compared to code2vec, the RNN-based paths representation can produce a better embedding model with fewer training parameters.

Full Text