A deep neural network language model with contexts for source code

Anh Tuan Nguyen,Hung Dang Phan,Trong Duc Nguyen,Tien N Nguyen

doi:10.1109/saner.2018.8330220

Abstract

Statistical language models (LMs) have been applied in several software engineering applications. However, they have issues in dealing with ambiguities in the names of program and API elements (classes and method calls). In this paper, inspired by the success of Deep Neural Network (DNN) in natural language processing, we present Dnn4C, a DNN language model that complements the local context of lexical code elements with both syntactic and type contexts. We designed a context-incorporating method to use with syntactic and type annotations for source code in order to learn to distinguish the lexical tokens in different syntactic and type contexts. Our empirical evaluation on code completion for real-world projects shows that Dnn4C relatively improves 11.6%, 16.3%, 27.1%, and 44.7% top-1 accuracy over the state-of-the-art language models for source code used with the same features: RNN LM, DNN LM, SLAMC, and n-gram LM, respectively. For another application, we showed that Dnn4C helps improve accuracy over n-gram LM in migrating source code from Java to C# with a machine translation model.

Full Text