Exploring True Test Overfitting in Dynamic Automated Program Repair using Formal Methods

Amirfarhad Nilizadeh,David R Cok,Gary T Leavens,Xuan-Bach D Le,Corina S Pasareanu

doi:10.1109/icst49551.2021.00033

Abstract

Automated program repair (APR) techniques have shown a promising ability to generate patches that fix program bugs automatically. Typically such APR tools are dynamic in the sense that they find bugs by testing and they validate patches by running a program's test suite. Patches can also be validated manually. However, neither of these methods for validating patches can truly tell whether a patch is correct. Test suites are usually incomplete, and thus APR-generated patches may pass the tests but not be truly correct; in other words, the APR tools may be overfitting to the tests. The possibility of test overfitting leads to manual validation, which is costly, potentially biased, and can also be incomplete. Therefore, we must move past these methods to truly assess APR's overfitting problem.We aim to evaluate the test overfitting problem in dynamic APR tools using ground truth given by a set of programs equipped with formal behavioral specifications. Using these formal specifications and an automated verification tool, we found that there is definitely overfitting in the generated patches of seven well-studied APR tools, although many (about 59%) of the generated patches were indeed correct. Our study further points out two new problems that can affect APR tools: changes to the complexity of programs and numeric problems. An additional contribution is that we introduce the first publicly available data set of formally specified and verified Java programs, their test suites, and buggy variants, each of which has exactly one bug.

Full Text