A Mocktail of Source Code Representations

Dheeraj Vagavolu,Sridhar Chimalakonda,Karthik Chandra Swarna

doi:10.1109/ase51524.2021.9678551

Abstract

Efficient representation of source code is essential for various software engineering tasks such as code classification and code clone detection. Most recent approaches for representing source code still use AST and do not leverage semantic graphs such as CFG and PDG. One effective technique for representing source code involves extracting paths from the AST and using a learning model to capture program properties. Code2vec is one such path-based approach that uses an attention-based neural network to learn code embeddings which can then be used for various downstream tasks. However, this approach uses only AST and does not leverage CFG and PDG. Even though an integrated graph approach (Code Property Graph) exists for representing source code, it has only been explored in the domain of software security. Moreover, it does not leverage the paths from the individual graphs. Our idea is to extend the path-based approach code2vec to include the semantic graphs CFG and PDG with AST, which is largely unexplored in software engineering. We evaluate our approach on the task of METHODNAMING using a C dataset of 730K methods collected from GitHub. In comparison to code2vec, our approach improves the F1 score by 11% on the full dataset and up to 100% with individual projects. We show that semantic features from the CFG and PDG paths drastically improve the performance of the software engineering tasks. We envision that looking at a mocktail of source code representations for various software engineering tasks can lay the foundation for a new line of research and a re-haul of existing research.

Full Text