Automatic Code Annotation Generation Based on Heterogeneous Graph Structure

Zhijie Jiang,Yingwei Ma,Yao Zhang,Haixu Xiong,Yun Xiong,Shanshan Li,Yan Ding

doi:10.1109/saner56733.2023.00053

Abstract

Automatic code annotation generation aims to generate readable annotations that describe the functionality of source code, which may facilitate software developers and programmers. Previous methods follow the encoder-decoder structures where the encoders are based on the abstract syntax trees (ASTs) to encode syntactic structures of code fragments. However, the AST alone cannot fully express complicated control structures, data flows, or dependencies of source code, leading to sub-optimal annotations. On the other hand, a functionality can be implemented in various ways with possibly different structures and token names. Most methods treat code fragments independently and do not exploit these similarities among code fragments. In this paper, we present HANCode2Seq, an automatic code annotation generation method by utilizing the code heterogeneous representation graph. Specifically, we construct the heterogeneous graph by combining multiple code induced graphs, including abstract syntax trees, control flow graphs, data flow graphs, and program dependency graphs. Then a heterogeneous graph attention network is applied to extract the comprehensive semantic meanings and syntactic structures of the source code fragments. Furthermore, we present a novel adaptive code similarity graph with code fragments being nodes. The representation of a code fragment is enhanced by aggregating information from other similar fragments on the graph, which may reduce the ambiguity of the code. The experimental results on real datasets show that our proposed model outperforms other baselines and produces more fluent and readable code annotations.

Full Text