Data Augmentation by Program Transformation

Shiwen Yu,Ting Wang,Ji Wang

doi:10.1016/j.jss.2022.111304

Abstract

Data Augmentation has been recognized as one of the main techniques to improve deep learning models’ generalization ability. However, it has not been widely leveraged in big code tasks due to the essential difficulties of manipulating source code to generate new labeled data of high quality. In this paper, we propose a general data augmentation method based on program transformation. The idea is to extend big code datasets by a set of source-to-source transformation rules that preserve not only the semantics but also the syntactic naturalness of programs. Through controlled experiments, we demonstrated that semantic and syntax-naturalness preserving are the expected properties for a transformation rule to be effective in data augmentation. We designed 18 transformation rules that are proved semantic-preserving and tested syntax-naturalness-preserving. We also implemented and open-sourced a partial program transformation tool for Java based on the rules, named SPAT, whose effectiveness for data augmentation is validated in three big code tasks: method naming, code commenting, and code clone detection.Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.

Full Text