Reference-Based Retrieval-Augmented Unit Test Generation
Automated unit test generation has been widely studied, with Large Language Models (LLMs) recently showing significant potential. LLMs like GPT-4, trained in vast text and code data, excel in various code-related tasks, including unit test generation. However, existing LLM-based approaches often focus solely on the context within the code itself, such as referenced variables, while neglecting broader task-specific contexts, such as the utility of referring to existing tests of relevant methods in unit test generation. Moreover, in the context of unit test generation, these tools prioritize high code coverage, often at the expense of practical usability, correctness, and maintainability. In response, we propose Reference-Based Retrieval Augmentation , a novel mechanism that extends LLM-based Retrieval-Augmented Generation (RAG) to retrieve relevant information by considering task-specific context. In the unit test generation task, for a given focal method, the reference relationships is defined as the reusability or referentiality of tests between the focal method and other methods. To generate high-quality unit tests for the focal method, the test reference relationships are then used to retrieve relevant methods and their existing unit tests. Specifically, we account for the unique structure of unit tests by dividing the test generation process into Given , When , and Then phases. When generating unit tests for a focal method, we retrieve pre-existing tests of other relevant methods, which can provide valuable insights for any of the Given , When , and Then phases. We implement this approach in a tool called RefTest , which sequentially performs preprocessing, test reference retrieval, and unit test generation, using an incremental strategy in which newly generated tests guide the creation of subsequent ones. We evaluated RefTest on 12 open-source projects with 1515 methods, and the results demonstrate that RefTest consistently outperforms existing tools in terms of correctness, completeness, and maintainability of the generated tests.
- Research Article
4
- 10.1145/3763791
- Aug 26, 2025
- ACM Transactions on Software Engineering and Methodology
Unit testing plays a pivotal role in the software development lifecycle, as it ensures code quality. However, writing high-quality unit tests remains a time-consuming task for developers in practice. More recently, the application of large language models (LLMs) in automated unit test generation has demonstrated promising results. Existing approaches primarily focus on interpreted programming languages (e.g., Java), while mature solutions tailored to compiled programming languages like C++ are yet to be explored. The intricate language features of C++, such as pointers, templates, and virtual functions, pose particular challenges for LLMs in generating both executable and high-coverage unit tests. To tackle the aforementioned problems, this paper introduces CITYWALK , a novel LLM-based framework for C++ unit test generation. CITYWALK enhances LLMs by providing a comprehensive understanding of the dependency relationships within the project under test via program analysis. Furthermore, CITYWALK incorporates language-specific knowledge about C++ derived from project documentation and empirical observations, significantly improving the correctness of the LLM-generated unit tests. We implement CITYWALK by employing the widely popular LLM GPT-4o. The experimental results show that CITYWALK outperforms current state-of-the-art approaches on a collection of ten popular C++ projects. Our findings demonstrate the effectiveness of CITYWALK in generating high-quality C++ unit tests.
- Research Article
2
- 10.1145/3745765
- Mar 11, 2026
- ACM Transactions on Software Engineering and Methodology
Recently, Large Language Models (LLMs) have shown promising results in code generation, and several automated test generation approaches based on LLMs have been proposed. Although these approaches achieve promising performance, they suffer from two limitations. First, they lack the intrinsic understanding of the semantic intricacies and logical constructs inherent to the focal method. Second, they ignore the diversity of the generated tests and generate tests with limited code coverage. To alleviate these two limitations, in this work, we propose a novel approach named TestCTRL that optimizes LLMs for unit test generation by the Chain-of-Thought (CoT) prompt and Reinforcement Learning (RL) strategy. Specifically, we first build a new CoT dataset, containing the focal methods, corresponding unit tests, and CoT prompts. The CoT prompt includes the intention and possible test input values. Then, the CoT dataset is used to fine-tune one LLM (i.e., CodeLlama 7B) that can be seen as the policy model in RL. Meanwhile, we fine-tune another LLM (i.e., CodeGPT) as the reward model by predicting the line coverage of the focal method and its test. Moreover, we employ the Proximal Policy Optimization (PPO) algorithm to optimize the policy model and generate unit tests. We use the Defects4J benchmark to evaluate our approach from three perspectives (i.e., naturalness, validity, and code coverage). To avoid data leakage threats, we filtered out data from the CoT dataset that have the same focal method and test case names as those in the Defects4J. The experimental results demonstrate that TestCTRL outperforms state-of-the-art baselines in line and branch coverages, respectively. Besides, TestCTRL improves bug detection performance. We also investigate the reason for the proposed approach’s superiority.
- Research Article
80
- 10.1145/3660783
- Jul 12, 2024
- Proceedings of the ACM on Software Engineering
Unit testing plays an essential role in detecting bugs in functionally-discrete program units ( e.g. , methods). Manually writing high-quality unit tests is time-consuming and laborious. Although the traditional techniques are able to generate tests with reasonable coverage, they are shown to exhibit low readability and still cannot be directly adopted by developers in practice. Recent work has shown the large potential of large language models (LLMs) in unit test generation. By being pre-trained on a massive developer-written code corpus, the models are capable of generating more human-like and meaningful test code. In this work, we perform the first empirical study to evaluate the capability of ChatGPT ( i.e ., one of the most representative LLMs with outstanding performance in code generation and comprehension) in unit test generation. In particular, we conduct both a quantitative analysis and a user study to systematically investigate the quality of its generated tests in terms of correctness, sufficiency, readability, and usability. We find that the tests generated by ChatGPT still suffer from correctness issues, including diverse compilation errors and execution failures (mostly caused by incorrect assertions); but the passing tests generated by ChatGPT almost resemble manually-written tests by achieving comparable coverage, readability, and even sometimes developers’ preference. Our findings indicate that generating unit tests with ChatGPT could be very promising if the correctness of its generated tests could be further improved. Inspired by our findings above, we further propose ChatTester , a novel ChatGPT-based unit test generation approach, which leverages ChatGPT itself to improve the quality of its generated tests. Chat Tester incorporates an initial test generator and an iterative test refiner. Our evaluation demonstrates the effectiveness of ChatTester by generating 34.3 % more compilable tests and 18.7 % more tests with correct assertions than the default ChatGPT. In addition to ChatGPT, we further investigate the generalization capabilities of ChatTester by applying it to two recent open-source LLMs ( i.e. , CodeLlama-Instruct and CodeFuse) and our results show that ChatTester can also improve the quality of tests generated by these LLMs.
- Research Article
4
- 10.1145/3728970
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
Unit testing plays a crucial role in bug detection and ensuring software correctness. It helps developers identify errors early in development, thereby reducing software defects. In recent years, large language models (LLMs) have demonstrated significant potential in automating unit test generation. However, using LLMs to generate unit tests faces many challenges. 1) The execution pass rate of the test cases generated by LLMs is low. 2) The test case coverage is inadequate, making it challenging to detect potential risks in the code. 3) Current research methods primarily focus on languages such as Java and Python, while studies on C programming are scarce, despite its importance in the real world. To address these challenges, we propose STRUT, a novel unit test generation method. STRUT utilizes structured test cases as a bridge between complex programming languages and LLMs. Instead of directly generating test code, STRUT guides LLMs to produce structured test cases, thereby alleviating the limitations of LLMs when generating code for programming languages with complex features. First, STRUT analyzes the context of focal methods and constructs structured seed test cases for them. These seed test cases then guide LLMs to generate a set of structured test cases. Subsequently, a rule-based approach is employed to convert the structured set of test cases into executable test code. We conducted a comprehensive evaluation of STRUT, which achieved an impressive execution pass rate of 96.01%, along with 77.67% line coverage and 63.60% branch coverage. This performance significantly surpasses that of the LLMs-based baseline methods and the symbolic execution tool SunwiseAUnit. These results highlight STRUT's superior capability in generating high-quality unit test cases by leveraging the strengths of LLMs while addressing their inherent limitations.
- Conference Article
90
- 10.1145/3663529.3663801
- Jul 10, 2024
Unit testing is an essential yet frequently arduous task. Various automated unit test generation tools have been introduced to mitigate this challenge. Notably, methods based on large language models (LLMs) have garnered considerable attention and exhibited promising results in recent years. Nevertheless, LLM-based tools encounter limitations in generating accurate unit tests. This paper presents ChatUniTest, an LLM-based automated unit test generation framework. ChatUniTest incorporates an adaptive focal context mechanism to encompass valuable context in prompts and adheres to a generation-validation-repair mechanism to rectify errors in generated unit tests. Subsequently, we have developed ChatUniTest Core, a common library that implements core workflow, complemented by the ChatUniTest Toolchain, a suite of seamlessly integrated tools enhancing the capabilities of ChatUniTest. Our effectiveness evaluation reveals that ChatUniTest outperforms TestSpark and EvoSuite in half of the evaluated projects, achieving the highest overall line coverage. Furthermore, insights from our user study affirm that ChatUniTest delivers substantial value to various stakeholders in the software testing domain. ChatUniTest is available at https://github.com/ZJU-ACES-ISE/ChatUniTest, and the demo video is available at https://www.youtube.com/watch?v=GmfxQUqm2ZQ.
- Dissertation
- 10.31979/etd.kddt-d7ms
- Jan 1, 2025
Unit test generation is a critical step in the software development lifecycle to ensure code quality and reduce the likelihood of bugs. Manually writing unit tests can be time-consuming and require an experienced developer. However with the emergence of generative AI, large language models (LLMs) in particular have demonstrated their effectiveness in generating code, which naturally brings up the question of the possibility of applying this capability to automate unit test generation. One of the newer techniques in this field is using Reinforcement Learning (RL) to train a model to generate quality unit tests. RL is the practice of training an agent to take optimal actions to maximize a reward signal. By treating the LLM as an agent and fine-tuning its parameters through feedback from the reward signal, it offers an adaptive and flexible method for improving LLM performance instead of relying on pre-trained models. This project explores different methodologies to augment a multi-model unit test generation framework including the use of RL to train its test generation capabilities. Using datasets derived from LeetCode and PyMethods2Test, our tool is evaluated against strong baseline LLMs like Gemini and Claude. The results show that the PPO-trained DeepSeek model consistently outperforms baseline generation, achieving higher test pass rates, fewer syntax errors, and improved coverage and mutation scores across both datasets, demonstrating that our framework presents an effective unit test generation method.
- Research Article
3
- 10.1145/3715778
- Jun 19, 2025
- Proceedings of the ACM on Software Engineering
Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests.
- Conference Article
34
- 10.1145/3691620.3695529
- Oct 27, 2024
Unit testing is an essential activity in software development for verifying the correctness of software components. However, manually writing unit tests is challenging and time-consuming. The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation. Existing research primarily focuses on closed-source LLMs (e.g., ChatGPT and CodeX) with fixed prompting strategies, leaving the capabilities of advanced open-source LLMs with various prompting settings unexplored. Particularly, open-source LLMs offer advantages in data privacy protection and have demonstrated superior performance in some tasks. Moreover, effective prompting is crucial for maximizing LLMs' capabilities. In this paper, we conduct the first empirical study to fill this gap, based on 17 Java projects, five widely-used open-source LLMs with different structures and parameter sizes, and comprehensive evaluation metrics. Our findings highlight the significant influence of various prompt factors, show the performance of open-source LLMs compared to the commercial GPT-4 and the traditional Evosuite, and identify limitations in LLM-based unit test generation. We then derive a series of implications from our study to guide future research and practical use of LLM-based unit test generation.
- Research Article
257
- 10.1109/tse.2023.3334955
- Jan 1, 2024
- IEEE Transactions on Software Engineering
Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to various aspects of software development, including their suggested use for automated generation of unit tests, but while requiring additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without requiring additional training or manual effort. Concretely, we consider an approach where the LLM is provided with prompts that include the signature and implementation of a function under test, along with usage examples extracted from documentation. Furthermore, if a generated test fails, our approach attempts to generate a new test that fixes the problem by re-prompting the model with the failing test and error message. We implement our approach in <small>TestPilot</small> , an adaptive LLM-based test generation tool for JavaScript that automatically generates unit tests for the methods in a given project's API. We evaluate <small>TestPilot</small> using OpenAI's <i>gpt3.5-turbo</i> LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%. In contrast, the state-of-the feedback-directed JavaScript test generation technique, Nessie, achieves only 51.3% statement coverage and 25.6% branch coverage. Furthermore, experiments with excluding parts of the information included in the prompts show that all components contribute towards the generation of effective test suites. We also find that 92.8% of <small>TestPilot</small> 's generated tests have <inline-formula><tex-math notation="LaTeX">$\leq$</tex-math></inline-formula> 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run <small>TestPilot</small> with two additional LLMs, OpenAI's older <i>code-cushman-002</i> LLM and <i>StarCoder</i> , an LLM for which the training process is publicly documented. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model.
- Conference Article
27
- 10.1145/3643795.3648396
- Apr 20, 2024
Generating unit tests is a crucial task in software development, demanding substantial time and effort from programmers. The advent of Large Language Models (LLMs) introduces a novel avenue for unit test script generation. This research aims to experimentally investigate the effectiveness of LLMs, specifically exemplified by ChatGPT, for generating unit test scripts for Python programs, and how the generated test cases compare with those generated by an existing unit test generator (Pynguin). For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. The generated test cases are evaluated based on criteria such as coverage, correctness, and readability. Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin. We also find that about a third of assertions generated by ChatGPT for some categories were incorrect. Our results also show that there is minimal overlap in missed statements between ChatGPT and Pynguin, thus, suggesting that a combination of both tools may enhance unit test generation performance. Finally, in our experiments, prompt engineering improved ChatGPT's performance, achieving a much higher coverage.*These authors contributed equally.
- Research Article
7
- 10.15388/lmitt.2024.20
- May 13, 2024
- Vilnius University Open Series
Unit testing is a fundamental aspect of software development, ensuring the correctness and robustness of code implementations. Traditionally, unit tests are manually crafted by developers based on their understanding of the code and its requirements. However, this process can be time-consuming, errorprone, and may overlook certain edge cases. In recent years, there has been growing interest in leveraging large language models (LLMs) for automating the generation of unit tests. LLMs, such as GPT (Generative Pre-trained Transformer), CodeT5, StarCoder, LLaMA, have demonstrated remarkable capabilities in natural language understanding and code generation tasks. By using LLMs, researchers aim to develop techniques that automatically generate unit tests from code snippets or specifications, thus optimizing the software testing process. This paper presents a literature review of articles that use LLMs for unit test generation tasks. It also discusses the history of the most commonly used large language models and their parameters, including the first time they have been used for code generation tasks. The result of this study presents the large language models for code and unit test generation tasks and their increasing popularity in code generation domain, indicating a great promise for the future of unit test generation using LLMs.
- Conference Article
36
- 10.1109/icse43902.2021.00138
- May 1, 2021
Automatic unit test generation that explores the input space and produces effective test cases for given programs have been studied for decades. Many unit test generation tools that can help generate unit test cases with high structural coverage over a program have been examined. However, the fact that existing test generation tools are mainly evaluated on general software programs calls into question about its practical effectiveness and usefulness for machine learning libraries, which are statistically orientated and have fundamentally different nature and construction from general software projects. In this paper, we set out to investigate the effectiveness of existing unit test generation techniques on machine learning libraries. To investigate this issue, we conducted an empirical study on five widely used machine learning libraries with two popular unit testcase generation tools, i.e., EVOSUITE and Randoop. We find that (1) most of the machine learning libraries do not maintain a high-quality unit test suite regarding commonly applied quality metrics such as code coverage (on average is 34.1%) and mutation score (on average is 21.3%), (2) unit test case generation tools, i.e., EVOSUITE and Randoop, lead to clear improvements in code coverage and mutation score, however, the improvement is limited, and (3) there exist common patterns in the uncovered code across the five machine learning libraries that can be used to improve unit test case generation tasks.
- Conference Article
49
- 10.1145/3624032.3624035
- Sep 25, 2023
Context: Software testing ensures software quality, but developers often disregard it. The use of automated testing generation is pursued to reduce the consequences of overlooked test cases in a software project. Problem: In the context of Java programs, several tools can completely automate generating unit test sets. Additionally, studies are conducted to offer evidence regarding the quality of the generated test sets. However, it is worth noting that these tools rely on machine learning and other AI algorithms rather than incorporating the latest advancements in Large Language Models (LLMs). Solution: This work aims to evaluate the quality of Java unit tests generated by an OpenAI LLM algorithm, using metrics like code coverage and mutation test score. Method: For this study, 33 programs used by other researchers in the field of automated test generation were selected. This approach was employed to establish a baseline for comparison purposes. For each program, 33 unit test sets were generated automatically, without human interference, by changing Open AI API parameters. After executing each test set, metrics such as code line coverage, mutation score, and success rate of test execution were collected to evaluate the efficiency and effectiveness of each set. Summary of Results: Our findings revealed that the OpenAI LLM test set demonstrated similar performance across all evaluated aspects compared to traditional automated Java test generation tools used in the previous research. These results are particularly remarkable considering the simplicity of the experiment and the fact that the generated test code did not undergo human analysis.
- Conference Article
3
- 10.5753/sbes.2024.3561
- Sep 30, 2024
Writing unit tests is a time-consuming and labor-intensive development practice. Consequently, various techniques for automatically generating unit tests have been studied. Among them, the use of Large Language Models (LLMs) has recently emerged as a popular approach for automatically generating tests from natural language descriptions. Although many recent studies are dedicated to measuring the ability of LLMs to write valid unit tests, few evaluate the quality of these generated tests. In this context, this study aims to measure the quality of the test codes generated by GitHub Copilot in Python by detecting test smells in the test cases generated. To do this, we used approaches to generating unit tests by LLMs that have already been applied in the literature and collected a sample of 194 unit test cases in 30 Python test files. We then measured them using tools specialized in detecting test smells in Python. Finally, we conducted an evaluation of these test cases with software developers and software quality assurance professionals. Our results indicated that 47.4% of the tests generated by Copilot had at least one test smell detected, with a lack of documentation in the assertions being the most common quality problem. These findings indicate that although GitHub Copilot can generate valid unit tests, quality violations are still frequently found in these codes.
- Research Article
- 10.15388/mitt.2025.32
- May 12, 2025
- Vilnius University Open Series
Unit testing is critical in software quality assurance, and large language models (LLMs) offer an approach to automate this process. This paper evaluates the quality of unit tests generated by large language models using structured output prompts. The research applied six LLMs in generating unit tests across different classes of cyclomatic complexity of C# focal methods. The experiment result shows that LLMs generated results according to a strict structure output (Arrange-Act-Assert pattern) that significantly influences the quality of the generated unit tests.