Automatic Acquisition of Annotated Training Corpora for Test-Code Generation

Magdalena Kacmajor,John Kelleher

doi:10.3390/info10020066

Abstract

Open software repositories make large amounts of source code publicly available. Potentially, this source code could be used as training data to develop new, machine learning-based programming tools. For many applications, however, raw code scraped from online repositories does not constitute an adequate training dataset. Building on the recent and rapid improvements in machine translation (MT), one possibly very interesting application is code generation from natural language descriptions. One of the bottlenecks in developing these MT-inspired systems is the acquisition of parallel text-code corpora required for training code-generative models. This paper addresses the problem of automatically synthetizing parallel text-code corpora in the software testing domain. Our approach is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular for unit testing. Therefore, we propose synthesizing parallel corpora comprised of parsed test function names serving as code descriptions, aligned with the corresponding function bodies. We present the results of applying one of the state-of-the-art MT methods on such a generated dataset. Our experiments show that a neural MT model trained on our dataset can generate syntactically correct and semantically relevant short Java functions from quasi-natural language descriptions of functionality.

Highlights

As digitization spreads into all areas of business and social life, the pressure on software development organizations is growing
Designed for learning from bi-text corpora, in which the degree of equivalence between source and target languages is very high. We find that this off-the-shelf neural machine translation (MT) architecture performs well on our code-text corpora, which suggests that the quasi-natural language descriptions obtained using our approach are precise and consistent enough to allow direct translation to code
We have presented a method that exploits the availability of source code in open software repositories to automatically construct an aligned text-code dataset

Summary

Introduction

As digitization spreads into all areas of business and social life, the pressure on software development organizations is growing. The sheer amount of code being created, and the increasing complexity of software systems, fuels the need for new methods and tools to support the software development process. A widely adopted framework addressing the challenges of the modern software delivery lifecycle is the DevOps model [1], which is founded on the principles of continuous integration, continuous delivery, and continuous testing. Both the wisdom of the crowd and academic evidence [2] speak for the efficiency of DevOps practice, but adopting DevOps brings its own challenges, including a significant increase in the volume and frequency of testing. To generate unit test cases, existing approaches use information extracted from other software artifacts, such as code under test, specification models, or execution logs [3]

Objectives

Methods

Findings

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automatic Acquisition of Annotated Training Corpora for Test-Code Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Journal: Information	Publication Date: Feb 17, 2019
License type: CC BY 4.0

Similar Papers

What do Neural Machine Translation Models Learn about Morphology?
Yonatan Belinkov ... Nadir Durrani
-
Yonatan Belinkov, et. al.Yonatan Belinkov ... Nadir Durrani
01 Jan 2017
01 Jan 2017

End-to-End Neural Word Alignment Outperforms GIZA++
Thomas Zenkel ... John Denero
-
Thomas Zenkel, et. al.Thomas Zenkel ... John Denero
01 Jan 2020
01 Jan 2020

Neural Machine Translation model for University Email Application
Sandhya Aneja ... Nagender Aneja
-
Sandhya Aneja, et. al.Sandhya Aneja ... Nagender Aneja
11 Jul 2020
11 Jul 2020

Combining SMT and NMT Back-Translated Data for Efficient NMT
... Dimitar Shterionov
-
, et. al. ... Dimitar Shterionov
22 Oct 2019
22 Oct 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic Acquisition of Annotated Training Corpora for Test-Code Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information