Abstract

In this article, we empirically study the suitability of tests as acceptance criteria for automated program fixes, by checking patches produced by automated repair tools using a bug-finding tool, as opposed to previous works that used tests or manual inspections. We develop a number of experiments in which faulty programs from IntroClass , a known benchmark for program repair techniques, are fed to the program repair tools GenProg, Angelix, AutoFix and Nopol, using test suites of varying quality, including those accompanying the benchmark. We then check the produced patches against formal specifications using a bug-finding tool. Our results show that, in the studied scenarios, automated program repair tools are significantly more likely to accept a spurious program fix than producing an actual one. Using bounded-exhaustive suites larger than the originally given ones (with about 100 and 1,000 tests) we verify that overfitting is reduced but a) few new correct repairs are generated and b) some tools see their performance reduced by the larger suites and fewer correct repairs are produced. Finally, by comparing with previous work, we show that overfitting is underestimated in semantics-based tools and that patches not discarded using held-out tests may be discarded using a bug-finding tool.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.