Enhancing the Efficiency of Automated Program Repair via Greybox Analysis

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

In this paper, we pay attention to the efficiency of automated program repair (APR). Recently, an efficient patch scheduling algorithm, Casino, has been proposed to improve APR efficiency. Inspired by fuzzing, Casino adaptively chooses the next patch candidate to evaluate based on the results of previous evaluations. However, we observe that Casino utilizes only the test results, treating the patched program as a black box. Inspired by greybox fuzzing, we propose a novel patch-scheduling algorithm, Gresino, which leverages the internal state of the program to further enhance APR efficiency. Specifically, Gresino monitors the hit counts of branches observed during the execution of the program and uses them to guide the search for a valid patch. Our experimental evaluation on the Defects4J benchmark and eight APR tools demonstrates the efficacy of our approach.

Similar Papers
  • Research Article
  • Cite Count Icon 6
  • 10.1145/3654441
MTL-TRANSFER: Leveraging Multi-task Learning and Transferred Knowledge for Improving Fault Localization and Program Repair
  • Jun 27, 2024
  • ACM Transactions on Software Engineering and Methodology
  • Xu Wang + 7 more

Fault localization (FL) and automated program repair (APR) are two main tasks of automatic software debugging. Compared with traditional methods, deep learning-based approaches have been demonstrated to achieve better performance in FL and APR tasks. However, the existing deep learning-based FL methods ignore the deep semantic features or only consider simple code representations. And for APR tasks, existing template-based APR methods are weak in selecting the correct fix templates for more effective program repair, which are also not able to synthesize patches via the embedded end-to-end code modification knowledge obtained by training models on large-scale bug-fix code pairs. Moreover, in most of FL and APR methods, the model designs and training phases are performed separately, leading to ineffective sharing of updated parameters and extracted knowledge during the training process. This limitation hinders the further improvement in the performance of FL and APR tasks. To solve the above problems, we propose a novel approach called MTL-TRANSFER, which leverages a multi-task learning strategy to extract deep semantic features and transferred knowledge from different perspectives. First, we construct a large-scale open-source bug datasets and implement 11 multi-task learning models for bug detection and patch generation sub-tasks on 11 commonly used bug types, as well as one multi-classifier to learn the relevant semantics for the subsequent fix template selection task. Second, an MLP-based ranking model is leveraged to fuse spectrum-based, mutation-based and semantic-based features to generate a sorted list of suspicious statements. Third, we combine the patches generated by the neural patch generation sub-task from the multi-task learning strategy with the optimized fix template selecting order gained from the multi-classifier mentioned above. Finally, the more accurate FL results, the optimized fix template selecting order, and the expanded patch candidates are combined together to further enhance the overall performance of APR tasks. Our extensive experiments on widely-used benchmark Defects4J show that MTL-TRANSFER outperforms all baselines in FL and APR tasks, proving the effectiveness of our approach. Compared with our previously proposed FL method TRANSFER-FL (which is also the state-of-the-art statement-level FL method), MTL-TRANSFER increases the faults hit by 8/11/12 on Top-1/3/5 metrics (92/159/183 in total). And on APR tasks, the number of successfully repaired bugs of MTL-TRANSFER under the perfect localization setting reaches 75, which is 8 more than our previous APR method TRANSFER-PR. Furthermore, another experiment to simulate the actual repair scenarios shows that MTL-TRANSFER can successfully repair 15 and 9 more bugs (56 in total) compared with TBar and TRANSFER, which demonstrates the effectiveness of the combination of our optimized FL and APR components.

  • Research Article
  • Cite Count Icon 12
  • 10.1109/tse.2020.2987862
Restore: Retrospective Fault Localization Enhancing Automated Program Repair
  • Apr 2, 2020
  • IEEE Transactions on Software Engineering
  • Tongtong Xu + 5 more

Fault localization is a crucial step of automated program repair, because accurately identifying program locations that are most closely implicated with a fault greatly affects the effectiveness of the patching process. An ideal fault localization technique would provide precise information while requiring moderate computational resources—to best support an efficient search for correct fixes. In contrast, most automated program repair tools use standard fault localization techniques—which are not tightly integrated with the overall program repair process, and hence deliver only subpar efficiency. In this paper, we present <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">retrospective fault localization</i> : a novel fault localization technique geared to the requirements of automated program repair. A key idea of retrospective fault localization is to reuse the outcome of failed patch validation to support mutation-based dynamic analysis—providing accurate fault localization information without incurring onerous computational costs. We implemented retrospective fault localization in a tool called <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Restore</small> —based on the <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Jaid</small> Java program repair system. Experiments involving faults from the <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Defects4J</small> standard benchmark indicate that retrospective fault localization can boost automated program repair: <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Restore</small> efficiently explores a large fix space, delivering state-of-the-art effectiveness (41 <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Defects4J</small> bugs correctly fixed, 8 of which no other automated repair tool for Java can fix) while simultaneously boosting performance (speedup over 3 compared to <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Jaid</small> ). Retrospective fault localization is applicable to any automated program repair techniques that rely on fault localization and dynamic validation of patches.

  • Conference Article
  • Cite Count Icon 31
  • 10.1109/icsme.2016.68
Empirical Study on Synthesis Engines for Semantics-Based Program Repair
  • Oct 1, 2016
  • Xuan-Bach D Le + 2 more

Automatic Program Repair (APR) is an emerging and rapidly growing research area, with many techniques proposed to repair defective software. One notable state-of-the-art line of APR approaches is known as semantics-based techniques, e.g., Angelix, which extract semantics constraints, i.e., specifications, via symbolic execution and test suites, and then generate repairs conforming to these constraints using program synthesis. The repair capability of such approaches-expressive power, output quality, and scalability-naturally depends on the underlying synthesis technique. However, despite recent advances in program synthesis, not much attention has been paid to assess, compare, or leverage the variety of available synthesis engine capabilities in an APR context. In this paper, we empirically compare the effectiveness of different synthesis engines for program repair. We do this by implementing a framework on top of the latest semantics-based APR technique, Angelix, that allows us to use different such engines. For this preliminary study, we use a subset of bugs in the IntroClass benchmark, a dataset of many small programs recently proposed for use in evaluating APR techniques, with a focus on assessing output quality. Our initial findings suggest that different synthesis engines have their own strengths and weaknesses, and future work on semantics-based APR should explore innovative ways to exploit and combine multiple synthesis engines.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/fie44824.2020.9274053
Current State and Next Steps on Automated Hints for Students Learning to Code
  • Jan 1, 2020
  • Daniel Toll + 2 more

The core of this work-in-progress is that the best way to learn how to code is to practice by solving problems. However, if students have trouble with this, they can get frustrated and give up. Automated Tutoring Systems (ATS) aim to provide hints to help them solve the problems they encounter. Many of the existing systems offer general hints, e.g., “check the conditional statement” or help the student interpret the compiler or test-case errors. While this can be useful, we think that an ATS should provide interactive and specialized feedback for each program. We snowballed through publications on promising ATS and found that there are several such systems (in 27 publications), but we could also identify many challenges and that our requirements were not met by any existing system. For example, few of them work on general-purpose programming languages, e.g., Java, or scale to realistic problems consisting of multiple methods and classes. From the search, we find ATS based on Automated Program Repair (APR) shows the most promise. However, while program repair has the potential to generate specialized hints to help guide the student to a working state, studies that looked into these have identified further challenges. For example, many APR ATS tools only show the repaired program to the students, who then have to compare and modify their program accordingly. Another issue is that APR generally only modifies a few lines, so if the student solution is far from correct, the repair might fail. This can be solved by partial repair, i.e., the program is repaired so at least one additional test-case passes. While this increases the repair rate, it might make hints more difficult or point the students in a non-obvious or even “wrong” direction. The APR can take several minutes, which also makes it unsuitable for interactive ATS. We take a design science approach to define an ATS based on APR that attempts to address the identified challenges. We give a review of the state-of-the-art for the required components, e.g., APR, how to generate hints from differences between two programs. From this, we suggest a threestep roadmap; 1. identify suitable APR-tools, 2. construct an oversized test-suite, and 3. adopt APR to the tutoring context.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/apsec57359.2022.00020
Systematic Analysis of Defect-Specific Code Abstraction for Neural Program Repair
  • Dec 1, 2022
  • Kicheol Kim + 2 more

Automated program repair(APR) is in the spotlight in academia and the field to reduce the time and cost of maintenance for developers. Recently, APR has continued to study based on deep-learning models to understand and learn how to fix software bugs. Text-to-Text Transfer Transformer(T5), which scored state-of-the-art in natural language processing benchmarks, also showed promising results on program repair in recent studies. In deep-learning-based program repair studies, studies commonly propose code abstraction techniques to avoid vocabulary problems and learn fine code transformation to generate bug-fixing patches. However, there is not enough systematic analysis of code abstraction according to each bug type in deep-learning-based program repair. Therefore, We leverage TFix, T5-based program repair, to evaluate how code abstraction techniques affect neural program repair. Our experimental results showed that defect-specific code abstraction achives a higher average BLEU score than the existing code abstraction technique in both T5 and multilingual-T5(mT5) model-based TFix results. Also, mT5 model-based TFix, which is applied defect-specific code abstraction, gets a higher BLEU score in 37 error types of 52 ESLint error types than TFix.

  • Research Article
  • Cite Count Icon 10
  • 10.1145/3579637
Reliable Fix Patterns Inferred from Static Checkers for Automated Program Repair
  • May 26, 2023
  • ACM Transactions on Software Engineering and Methodology
  • Kui Liu + 8 more

Fix pattern-based patch generation is a promising direction in automated program repair (APR). Notably, it has been demonstrated to produce more acceptable and correct patches than the patches obtained with mutation operators through genetic programming. The performance of pattern-based APR systems, however, depends on the fix ingredients mined from fix changes in development histories. Unfortunately, collecting a reliable set of bug fixes in repositories can be challenging. In this article, we propose investigating the possibility in an APR scenario of leveraging fix patterns inferred from code changes that address violations detected by static analysis tools. To that end, we build a fix pattern-based APR tool, Avatar , which exploits fix patterns of static analysis violations as ingredients for the patch generation of repairing semantic bugs. Evaluated on four benchmarks (i.e., Defects4J, Bugs.jar, BEARS, and QuixBugs), Avatar presents the potential feasibility of fixing semantic bugs with the fix patterns inferred from the patches for fixing static analysis violations and can correctly fix 26 semantic bugs when Avatar is implemented with the normal program repair pipeline. We also find that Avatar achieves performance metrics that are comparable to that of the closely related approaches in the literature. Compared with CoCoNut, Avatar can fix 18 new bugs in Defects4J and 3 new bugs in QuixBugs. When compared with HDRepair, JAID, and SketchFix, Avatar can newly fix 14 Defects4J bugs. In terms of the number of correctly fixed bugs, Avatar is also comparable to the program repair tools with the normal fault localization setting and presents better performance than most program repair tools. These results imply that Avatar is complementary to current program repair approaches. We further uncover that Avatar can present different bug-fixing performances when it is configured with different fault localization tools, and the stack trace information from the failed executions of test cases can be exploited to improve the bug-fixing performance of Avatar by fixing more bugs with fewer generated patch candidates. Overall, our study highlights the relevance of static bug-finding tools as indirect contributors of fix ingredients for addressing code defects identified with functional test cases (i.e., dynamic information).

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icse-companion.2019.00066
Improving Automated Program Repair with Retrospective Fault Localization
  • May 1, 2019
  • Tongtong Xu

Although being recognized as a critical step in automated program repair, fault localization has been only loosely coupled into the fixing process in existing program repair approaches, in the sense that fault localization has limited interactions with other activities in fixing. We propose in this paper to deeply integrate fault localization into the fixing process to achieve more effective and efficient program repair. Our approach introduces a feedback loop in fixing between the activities for locating the fault causes and those for generating and evaluating candidate fixes. The feedback loop enables partial evaluation results of candidate fixes to be used to locate fault localization more accurately, and eventually leads to fixing processes with improved effectiveness and efficiency. We have implemented the approach into a tool, named RESTORE, based on the JAID program repair system. Experiments involving faults from the DEFECTS4J standard benchmark indicate that the integrated fault localization can boost automated program repair: RESTORE produced valid fixes to 63 faults and correct ones to 38 faults, outperforming any other state-of-the-art repair tool for Java while taking 36% less running time compared with JAID.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.jss.2017.06.039
The impacts of techniques, programs and tests on automated program repair: An empirical study
  • Jun 17, 2017
  • Journal of Systems and Software
  • Xianglong Kong + 3 more

The impacts of techniques, programs and tests on automated program repair: An empirical study

  • Research Article
  • Cite Count Icon 2
  • 10.1145/3770581
Integrating Various Software Artifacts for Better LLM-based Bug Localization and Program Repair
  • Oct 3, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Qiong Feng + 5 more

LLMs have garnered considerable attention for their potential to streamline Automated Program Repair (APR). LLM-based approaches can either insert the correct code using an infilling-style technique or directly generate patches when provided with buggy methods, aiming for plausible patches to pass all tests. However, most of LLM-based APR methods rely on a single type of software information, such as issue descriptions or error stack traces, without fully leveraging a combination of diverse software artifacts. Human developers, in contrast, often use a range of information — such as debugging data, issue discussions, and error stack traces — to diagnose and fix bugs. Despite this, many LLM-based approaches do not explore which specific types of software information best assist in localizing and repairing software bugs. Addressing this gap is crucial for advancing LLM-based APR techniques. To investigate this and mimic the way human developers fix bugs, we propose DEVLoRe (short for DEV eloper Lo calization and Re pair). In this framework, LLMs first use issue content (description and discussion) and stack error traces to localize buggy methods, then rely on debug information in buggy methods and issue content and stack error to localize buggy lines and generate valid patches. We evaluated the effectiveness of issue content, error stack traces, and debugging information in bug localization and automatic program repair. Our results show that while issue content and error stack is particularly effective in assisting LLMs with fault localization and program repair respectively, different types of software artifacts complement each other in addressing various bugs. By incorporating these three types of artifacts and using the Defects4J v2.0 dataset for evaluation, DEVLoRe successfully localizes 49.3% of single-method bugs and generates 56.0% plausible patches. Additionally, DEVLoRe can localize 47.6% of non-single-method bugs and generates 14.5% plausible patches. Moreover, our framework streamlines the end-to-end process from buggy source code to a complete repair, and achieves a 39.7% and 17.1% of single-method and non-single-method bug repair rate, outperforming current state-of-the-art APR methods. Furthermore, we re-implemented and evaluated our framework, demonstrating its effectiveness in resolving 9 unique issues compared to other state-of-the-art frameworks using the same or more advanced models on SWE-bench Lite. We also discussed whether a leading framework for Python code can be directly applied to Java code, or vice versa. The source code and experimental results of this work for replication are available at https://github.com/XYZboom/DEVLoRe .

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/icst49551.2021.00032
Automatic Program Repair as Semantic Suggestions: An Empirical Study
  • Apr 1, 2021
  • Diogo Campos + 3 more

Automated Program Repair (APR) is an area of research focused on the automatic generation of bug-fixing patches. Current APR approaches present some limitations, namely overfitted patches and low maintainability of the generated code. Several works are tackling this problem by attempting to come up with algorithms producing higher quality fixes. In this experience paper, we explore an alternative. We believe that by using existing low-cost APR techniques, fast enough to provide real-time feedback, and encouraging the developer to work together with the APR inside the IDE, will allow them to immediately discard proposed fixes deemed inappropriate or prone to reduce maintainability. Most developers are familiar with real-time syntactic code suggestions, usually provided as code completion mechanisms. What we propose are semantic code suggestions, such as code fixes, which are seldom automatic and rarely real-time. To test our hypothesis, we implemented a Visual Studio Code extension (named pAPRika), which leverages unit tests as specifications and generates code variations to repair bugs in JavaScript. We conducted a preliminary empirical study with 16 participants in a crossover design. Our results provide evidence that, although incorporating APR in the IDE improves the speed of repairing faulty programs, some developers are too eager to accept patches, disregarding maintenance concerns.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icsme52107.2021.00010
Revisiting Test Cases to Boost Generate-and-Validate Program Repair
  • Sep 1, 2021
  • Jingtang Zhang + 6 more

Fault localization produces bug positions as the basic input for many automated program repair (APR) systems. Given that test cases are the common means that automatic fault localization techniques leverage, we investigate the impact of their characteristics (in terms of quality and quantity) on APR. In particular, we analyze the statements that appear in crash stack traces when test cases fail (note that stack traces are available when an ordinary test case fails since its verdict is often made by assertions that produce errors such as AssertError in Java and JUnit), and explore the possibility of using some relevant crash information to enhance fault localization; this ultimately improves the effectiveness of APR tools. Our study reveals that the considered state-of-the-art APR systems achieve the best performance when fixing bugs associated with boolean type expected values (e.g., assertTrue ()) or assertFalse(). In contrast, they achieve their worst performance when addressing bugs related to null check assertions. Meanwhile, null check bugs as well as the bugs associated with boolean and string type expected values are still the main challenge that should be addressed by the future APR. For exception throwing bugs, existing APR systems present the best performance on fixing NullPointerException bugs, while the tough task of them is to resolve the bugs throwing developer-defined exceptions. The information in stack traces after executing the bug-triggering test cases can be used to effectively improve the performance on fault localization and program repair.

  • Research Article
  • Cite Count Icon 2
  • 10.1007/s11859-018-1358-2
Predicting Effectiveness of Generate-and-Validate Patch Generation Systems Using Random Forest
  • Dec 1, 2018
  • Wuhan University Journal of Natural Sciences
  • Yong Xu + 3 more

One way to improve practicability of automatic program repair (APR) techniques is to build prediction models which can predict whether an application of a APR technique on a bug is effective or not. Existing prediction models have some limitations. First, the prediction models are built with hand crafted features which usually fail to capture the semantic characteristics of program repair task. Second, the performance of the prediction models is only evaluated on Genprog, a genetic-programming based APR technique. This paper develops prediction models, i.e., random forest prediction models for SPR, another kind of generate- and-validate APR technique, which can distinguish ineffective repair instances from effective repair instances. Rather than handcrafted features, we use features automatically learned by deep belief network (DBN) to train the prediction models. The empirical results show that compared to the baseline models, that is, all effective models, our proposed models can at least improve the F1 by 9% and AUC(area under the receiver operating characteristics curve) by 19%. At the same time, the prediction model using learned features at least outperforms the one using hand-crafted features in terms of F1 by 11%

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 28
  • 10.1109/tse.2022.3194188
How do Developers Really Feel About Bug Fixing? Directions for Automatic Program Repair
  • Apr 1, 2023
  • IEEE Transactions on Software Engineering
  • Emily Winter + 6 more

Automatic program repair (APR) is a rapidly advancing field of software engineering that aims to supplement or replace manual bug fixing with an automated tool. For APR to be successfully adopted in industry, it is vital that APR tools respond to developer needs and preferences. However, very little research has considered developers&#x0027; general attitudes to APR or developers&#x0027; current bug fixing practices (the activity APR aims to replace). This paper responds to this gap by reporting on a survey of 386 software developers about their bug finding and fixing practices and experiences, and their instinctive attitudes towards APR. We find that bug finding and fixing is not necessarily as onerous for developers as has often been suggested, being rated as more satisfying than developers&#x0027; general work. The fact that developers derive satisfaction and benefit from bug fixing indicates that APR adoption is not as simple as APR replacing an unwanted activity. When it comes to potential APR approaches, we find a strong preference for developers being kept in the loop (for example, choosing between different fixes or validating fixes) as opposed to a fully automated process. This suggests that advances in APR should be careful to consider the agency of the developer, as well as what information is presented to developers alongside fixes. It also indicates that there are key barriers related to trust that would need to be overcome for full scale APR adoption, supported by the fact that even those developers who stated that they were positive about APR listed several caveats and concerns. We find very few statistically significant relationships between particular demographic variables (for example, developer experience, age, education) and key attitudinal variables, suggesting that developers&#x0027; instinctive attitudes towards APR are little influenced by experience level but are held widely across the developer community.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 23
  • 10.1109/tse.2022.3152089
Let’s Talk With Developers, Not About Developers: A Review of Automatic Program Repair Research
  • Jan 1, 2023
  • IEEE Transactions on Software Engineering
  • Emily Winter + 6 more

Automatic program repair (APR) offers significant potential for automating some coding tasks. Using APR could reduce the high costs historically associated with fixing code faults and deliver significant benefits to software engineering. Adopting APR could also have profound implications for software developers’ daily activities, transforming their work practices. To realise the benefits of APR it is vital that we consider how developers feel about APR and the impact APR may have on developers’ work. Developing APR tools without consideration of the developer is likely to undermine the success of APR deployment. In this paper, we critically review how developers are considered in APR research by analysing how human factors are treated in 260 studies from Monperrus’s Living Review of APR. Over half of the 260 studies in our review were motivated by a problem faced by developers (e.g., the difficulty associated with fixing faults). Despite these human-oriented motivations, fewer than 7% of the 260 studies included a human study. We looked in detail at these human studies and found their quality mixed (for example, one human study was based on input from only one developer). Our results suggest that software developers are often talked <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">about</i> in APR studies, but are rarely talked <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">with</i> . A more comprehensive and reliable understanding of developer human factors in relation to APR is needed. Without this understanding, it will be difficult to develop APR tools and techniques which integrate effectively into developers’ workflows. We recommend a future research agenda to advance the study of human factors in APR.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.csi.2024.103951
The Use of Large Language Models for Program Repair
  • Nov 24, 2024
  • Computer Standards & Interfaces
  • Fida Zubair + 2 more

Large Language Models (LLMs) have emerged as a promising approach for automated program repair, offering code comprehension and generation capabilities that can address software bugs. Several program repair models based on LLMs have been developed recently. However, findings and insights from these efforts are scattered across various studies, lacking a systematic overview of LLMs' utilization in program repair. Therefore, this Systematic Literature Review (SLR) was conducted to investigate the current landscape of LLM utilization in program repair. This study defined seven research questions and thoroughly selected 41 relevant studies from scientific databases to explore these questions. The results shed light on the diverse capabilities of LLMs for program repair. The findings revealed that Encoder-Decoder architectures emerged as the prevalent LLM design for program repair tasks and that mostly open-access datasets were used. Several evaluation metrics were applied, primarily consisting of accuracy, exact match, and BLEU scores. Additionally, the review investigated several LLM fine-tuning methods, including fine-tuning on specialized datasets, curriculum learning, iterative approaches, and knowledge-intensified techniques. These findings pave the way for further research on utilizing the full potential of LLMs to revolutionize automated program repair.

Save Icon
Up Arrow
Open/Close