The methodology that employs deep learning to handle software engineering tasks, such as bug detection, is commonly referred to as source code learning. Given the inherent graph nature of source code, graph learning, bolstered by graph neural networks (GNNs), has seen an increasing adoption in the domain of source code learning. Similar to other contexts within deep learning, source code learning also relies on extensive high-quality training data, and the scarcity of such data has become a primary impediment that leads to performance bottlenecks. In practice, data augmentation is often used as a countermeasure to mitigate this issue, by synthesizing additional training data based on existing ones. However, most existing practice of data augmentation in source code learning is limited to simple program transformation methods, such as code refactoring, thus not sufficiently effective. In this work, in light of the graph nature of source code, we propose to apply the data augmentation methods used for graph-structured data in graph learning to the tasks of source code learning, and we conduct a comprehensive empirical study to evaluate whether such new data augmentation approaches bring better effectiveness, in terms of producing more accurate and robust models. Specifically, we evaluate four critical software engineering tasks and seven neural network architectures to assess the effectiveness of five data augmentation methods. Experimental results identify that, compared to the data augmentation-free training approach, the Manifold-Mixup method can significantly improve both the accuracy and robustness of the trained models of source code learning, for up to 1.60% and 4.09%, respectively.
Read full abstract