On the Evaluation of Large Language Models in Unit Test Generation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Unit testing is an essential activity in software development for verifying the correctness of software components. However, manually writing unit tests is challenging and time-consuming. The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation. Existing research primarily focuses on closed-source LLMs (e.g., ChatGPT and CodeX) with fixed prompting strategies, leaving the capabilities of advanced opensource LLMs with various prompting settings unexplored. Particularly, open-source LLMs offer advantages in data privacy protection and have demonstrated superior performance in some tasks. Moreover, effective prompting is crucial for maximizing LLMs' capabilities. In this paper, we conduct the first empirical study to fill this gap, based on 17 Java projects, five widely-used open-source LLMs with different structures and parameter sizes, and comprehensive evaluation metrics. Our findings highlight the significant * Both authors contributed equally to this research. † This paper was completed during an internship at Huawei Cloud Computing Co.

Similar Papers
  • Research Article
  • 10.1145/3765758
Reference-Based Retrieval-Augmented Unit Test Generation
  • Dec 3, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Zhe Zhang + 5 more

Automated unit test generation has been widely studied, with Large Language Models (LLMs) recently showing significant potential. LLMs like GPT-4, trained in vast text and code data, excel in various code-related tasks, including unit test generation. However, existing LLM-based approaches often focus solely on the context within the code itself, such as referenced variables, while neglecting broader task-specific contexts, such as the utility of referring to existing tests of relevant methods in unit test generation. Moreover, in the context of unit test generation, these tools prioritize high code coverage, often at the expense of practical usability, correctness, and maintainability. In response, we propose Reference-Based Retrieval Augmentation , a novel mechanism that extends LLM-based Retrieval-Augmented Generation (RAG) to retrieve relevant information by considering task-specific context. In the unit test generation task, for a given focal method, the reference relationships is defined as the reusability or referentiality of tests between the focal method and other methods. To generate high-quality unit tests for the focal method, the test reference relationships are then used to retrieve relevant methods and their existing unit tests. Specifically, we account for the unique structure of unit tests by dividing the test generation process into Given , When , and Then phases. When generating unit tests for a focal method, we retrieve pre-existing tests of other relevant methods, which can provide valuable insights for any of the Given , When , and Then phases. We implement this approach in a tool called RefTest , which sequentially performs preprocessing, test reference retrieval, and unit test generation, using an incremental strategy in which newly generated tests guide the creation of subsequent ones. We evaluated RefTest on 12 open-source projects with 1515 methods, and the results demonstrate that RefTest consistently outperforms existing tools in terms of correctness, completeness, and maintainability of the generated tests.

  • Research Article
  • Cite Count Icon 2
  • 10.1145/3728970
STRUT: Structured Seed Case Guided Unit Test Generation for C Programs using LLMs
  • Jun 22, 2025
  • Proceedings of the ACM on Software Engineering
  • Jinwei Liu + 5 more

Unit testing plays a crucial role in bug detection and ensuring software correctness. It helps developers identify errors early in development, thereby reducing software defects. In recent years, large language models (LLMs) have demonstrated significant potential in automating unit test generation. However, using LLMs to generate unit tests faces many challenges. 1) The execution pass rate of the test cases generated by LLMs is low. 2) The test case coverage is inadequate, making it challenging to detect potential risks in the code. 3) Current research methods primarily focus on languages such as Java and Python, while studies on C programming are scarce, despite its importance in the real world. To address these challenges, we propose STRUT, a novel unit test generation method. STRUT utilizes structured test cases as a bridge between complex programming languages and LLMs. Instead of directly generating test code, STRUT guides LLMs to produce structured test cases, thereby alleviating the limitations of LLMs when generating code for programming languages with complex features. First, STRUT analyzes the context of focal methods and constructs structured seed test cases for them. These seed test cases then guide LLMs to generate a set of structured test cases. Subsequently, a rule-based approach is employed to convert the structured set of test cases into executable test code. We conducted a comprehensive evaluation of STRUT, which achieved an impressive execution pass rate of 96.01%, along with 77.67% line coverage and 63.60% branch coverage. This performance significantly surpasses that of the LLMs-based baseline methods and the symbolic execution tool SunwiseAUnit. These results highlight STRUT's superior capability in generating high-quality unit test cases by leveraging the strengths of LLMs while addressing their inherent limitations.

  • Research Article
  • Cite Count Icon 2
  • 10.1145/3715778
Less Is More: On the Importance of Data Quality for Unit Test Generation
  • Jun 19, 2025
  • Proceedings of the ACM on Software Engineering
  • Junwei Zhang + 5 more

Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests.

  • Research Article
  • 10.15388/mitt.2025.32
Quality Evaluation of Large Language Models Generated Unit Tests: Influence of Structured Output
  • May 12, 2025
  • Vilnius University Open Series
  • Dovydas Marius Zapkus + 1 more

Unit testing is critical in software quality assurance, and large language models (LLMs) offer an approach to automate this process. This paper evaluates the quality of unit tests generated by large language models using structured output prompts. The research applied six LLMs in generating unit tests across different classes of cyclomatic complexity of C# focal methods. The experiment result shows that LLMs generated results according to a strict structure output (Arrange-Act-Assert pattern) that significantly influences the quality of the generated unit tests.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/icodse.2015.7437005
Unit test code generator for lua programming language
  • Nov 1, 2015
  • Junno Tantra Pratama Wibowo + 2 more

Software testing is an important step in the software development lifecycle. One of the main process that take lots of time is developing the test code. We propose an automatic unit test code generation to speed up the process and helps avoiding repetition. We develop the unit test code generator using Lua programming language. Lua is a fast, lightweight, embeddable scripting language. It has been used in many industrial applications with focuses on embedded systems and games. Unlike other popular scripting language like JavaScript, Python, and Ruby, Lua does not have any unit test generator developed to help its software testing process. The final product, Lua unit test generator (LUTG), integrated to one of the most popular Lua IDE, ZeroBrane Studio, as a plugin to seamlessly connect the coding and testing process. The code generator can generate unit test code, save test cases data on Lua and XML file format, and generate the test data automatically using search-based technique, genetic algorithm, to achieve full branch coverage test criteria. Using this generator to test several Lua source code files shows that the developed unit test generator can help the unit testing process. It was expected that the unit test generator can improve productivity, quality, consistency, and abstraction of unit testing process.

  • Research Article
  • Cite Count Icon 6
  • 10.15388/lmitt.2024.20
Unit Test Generation Using Large Language Models: A Systematic Literature Review
  • May 13, 2024
  • Vilnius University Open Series
  • Dovydas Marius Zapkus + 1 more

Unit testing is a fundamental aspect of software development, ensuring the correctness and robustness of code implementations. Traditionally, unit tests are manually crafted by developers based on their understanding of the code and its requirements. However, this process can be time-consuming, errorprone, and may overlook certain edge cases. In recent years, there has been growing interest in leveraging large language models (LLMs) for automating the generation of unit tests. LLMs, such as GPT (Generative Pre-trained Transformer), CodeT5, StarCoder, LLaMA, have demonstrated remarkable capabilities in natural language understanding and code generation tasks. By using LLMs, researchers aim to develop techniques that automatically generate unit tests from code snippets or specifications, thus optimizing the software testing process. This paper presents a literature review of articles that use LLMs for unit test generation tasks. It also discusses the history of the most commonly used large language models and their parameters, including the first time they have been used for code generation tasks. The result of this study presents the large language models for code and unit test generation tasks and their increasing popularity in code generation domain, indicating a great promise for the future of unit test generation using LLMs.

  • Research Article
  • Cite Count Icon 3
  • 10.1145/3763791
CITYWALK : Enhancing LLM-Based C++ Unit Test Generation via Project-Dependency Awareness and Language-Specific Knowledge
  • Aug 26, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Yuwei Zhang + 8 more

Unit testing plays a pivotal role in the software development lifecycle, as it ensures code quality. However, writing high-quality unit tests remains a time-consuming task for developers in practice. More recently, the application of large language models (LLMs) in automated unit test generation has demonstrated promising results. Existing approaches primarily focus on interpreted programming languages (e.g., Java), while mature solutions tailored to compiled programming languages like C++ are yet to be explored. The intricate language features of C++, such as pointers, templates, and virtual functions, pose particular challenges for LLMs in generating both executable and high-coverage unit tests. To tackle the aforementioned problems, this paper introduces CITYWALK , a novel LLM-based framework for C++ unit test generation. CITYWALK enhances LLMs by providing a comprehensive understanding of the dependency relationships within the project under test via program analysis. Furthermore, CITYWALK incorporates language-specific knowledge about C++ derived from project documentation and empirical observations, significantly improving the correctness of the LLM-generated unit tests. We implement CITYWALK by employing the widely popular LLM GPT-4o. The experimental results show that CITYWALK outperforms current state-of-the-art approaches on a collection of ten popular C++ projects. Our findings demonstrate the effectiveness of CITYWALK in generating high-quality C++ unit tests.

  • Conference Article
  • Cite Count Icon 2
  • 10.1145/3593434.3593443
NxtUnit: Automated Unit Test Generation for Go
  • Jun 14, 2023
  • Siwei Wang + 5 more

Automated test generation has been extensively studied for dynamically compiled or typed programming languages like Java and Python. However, Go, a popular statically compiled and typed programming language for server application development, has received limited support from existing tools. To address this gap, we present NxtUnit, an automatic unit test generation tool for Go that uses random testing and is well-suited for microservice architecture. NxtUnit employs a random approach to generate unit tests quickly, making it ideal for smoke testing and providing quick quality feedback. It comes with three types of interfaces: an integrated development environment (IDE) plugin, a command-line interface (CLI), and a browser-based platform. The plugin and CLI tool allow engineers to write unit tests more efficiently, while the platform provides unit test visualization and asynchronous unit test generation. We evaluated NxtUnit by generating unit tests for 13 open-source repositories and 500 ByteDance in-house repositories, resulting in a code coverage of 20.74% for in-house repositories. We conducted a survey among Bytedance engineers and found that NxtUnit can save them 48% of the time on writing unit tests. We have made the CLI tool available at https://github.com/bytedance/nxt_unit.

  • Conference Article
  • 10.5753/sast.2025.14036
On the Energy Footprint of Using a Small Language Model for Unit Test Generation
  • Sep 22, 2025
  • Rafael S Durelli + 2 more

Context. Manual unit test creation is a cognitively intensive and time-consuming activity, prompting researchers and practitioners to increasingly adopt automated testing tools. Recent advancements in language models have expanded automation possibilities, including unit test generation, yet these models raise substantial sustainability concerns due to their energy consumption compared to conventional, specialized tools. Goal. Our research investigates whether the energy overhead associated with employing a small language model (SLM) for unit test generation is justified compared to a conventional, lightweight testing tool. We compare and analyze the energy consumption incurred during test suite generation, as well as the fault-finding effectiveness of the resulting test suites, for an SLM (Phi-3.1 Mini 128k) and Pynguin, a purpose-built tool for unit test generation. Method.We posed two research questions: (i) What is the difference in energy usage between Phi and Pynguin during the generation of unit test suites for Python programs?; and (ii) To what extent do unit test suites generated by Phi and Pynguin differ in their fault-finding effectiveness? To rigorously address the first research question, we employed Bayesian Data Analysis (BDA). For the second research question, we conducted a complementary empirical analysis using descriptive statistics. Results. Our Bayesian analysis provides robust evidence indicating that Phi consistently consumes significantly more energy than Pynguin during test suite generation. Conclusions. These findings underscore significant sustainability concerns associated with employing even SLMs for routine Software Engineering tasks such as unit test generation. The results challenge the assumption of universal energy efficiency benefits from smaller-scale models and emphasize the necessity for careful energy consumption evaluations in the adoption of automated software testing approaches.

  • Conference Article
  • Cite Count Icon 36
  • 10.1109/icse43902.2021.00138
Automatic Unit Test Generation for Machine Learning Libraries: How Far Are We?
  • May 1, 2021
  • Song Wang + 5 more

Automatic unit test generation that explores the input space and produces effective test cases for given programs have been studied for decades. Many unit test generation tools that can help generate unit test cases with high structural coverage over a program have been examined. However, the fact that existing test generation tools are mainly evaluated on general software programs calls into question about its practical effectiveness and usefulness for machine learning libraries, which are statistically orientated and have fundamentally different nature and construction from general software projects. In this paper, we set out to investigate the effectiveness of existing unit test generation techniques on machine learning libraries. To investigate this issue, we conducted an empirical study on five widely used machine learning libraries with two popular unit testcase generation tools, i.e., EVOSUITE and Randoop. We find that (1) most of the machine learning libraries do not maintain a high-quality unit test suite regarding commonly applied quality metrics such as code coverage (on average is 34.1%) and mutation score (on average is 21.3%), (2) unit test case generation tools, i.e., EVOSUITE and Randoop, lead to clear improvements in code coverage and mutation score, however, the improvement is limited, and (3) there exist common patterns in the uncovered code across the five machine learning libraries that can be used to improve unit test case generation tasks.

  • Research Article
  • Cite Count Icon 1
  • 10.21609/jiki.v17i1.1198
Implementation Genetic Algorithm for Optimization of Kotlin Software Unit Test Case Generator
  • Feb 25, 2024
  • Jurnal Ilmu Komputer dan Informasi
  • Mohammad Andiez Satria Permana + 2 more

Unit testing has a significant role in software development and its impacts depend on the quality of test cases and test data used. To reduce time and effort, unit test generator systems can help automatically generate test cases and test data. However, there is currently no unit test generator for Kotlin programming language even though this language is popularly used for android application developments. In this study, we propose and develop a test generator system that utilizes genetic algorithm (GA) and ANTLR4 parser. GA is used to obtain the most optimal test cases and data for a given Kotlin code. ANTLR4 parser is used to optimize the mutation process in GA so that the mutation process is not totally random. Our model results showed that the average value of code coverage in generated unit tests against instruction coverage is 95.64%, with branch coverage of 76.19% and line coverage of 96.87%. In addition, only two out of eight generated classes produced duplicate test cases with a maximum of one duplication in each class. Therefore, it can be concluded that our optimization with GA on the unit test generator is able to produce unit tests with high code coverage and low duplication.

  • Conference Article
  • Cite Count Icon 3
  • 10.5753/sbes.2024.3561
Detecting Test Smells in Python Test Code Generated by LLM: An Empirical Study with GitHub Copilot
  • Sep 30, 2024
  • Victor Anthony Alves + 3 more

Writing unit tests is a time-consuming and labor-intensive development practice. Consequently, various techniques for automatically generating unit tests have been studied. Among them, the use of Large Language Models (LLMs) has recently emerged as a popular approach for automatically generating tests from natural language descriptions. Although many recent studies are dedicated to measuring the ability of LLMs to write valid unit tests, few evaluate the quality of these generated tests. In this context, this study aims to measure the quality of the test codes generated by GitHub Copilot in Python by detecting test smells in the test cases generated. To do this, we used approaches to generating unit tests by LLMs that have already been applied in the literature and collected a sample of 194 unit test cases in 30 Python test files. We then measured them using tools specialized in detecting test smells in Python. Finally, we conducted an evaluation of these test cases with software developers and software quality assurance professionals. Our results indicated that 47.4% of the tests generated by Copilot had at least one test smell detected, with a lack of documentation in the assertions being the most common quality problem. These findings indicate that although GitHub Copilot can generate valid unit tests, quality violations are still frequently found in these codes.

  • Research Article
  • 10.3390/e28010074
IGTG&R: An Intent Analysis-Guided Unit Test Generation and Refinement Framework
  • Jan 9, 2026
  • Entropy
  • Xiaojian Liu + 1 more

Code coverage-guided unit test generation (CGTG) and large language model-based test generation (LLMTG) are two principal approaches for the generation of unit tests. Each of these approaches has its inherent advantages and drawbacks. Tests generated by CGTG have been shown to exhibit high code coverage and high executability. However, they lack the capacity to comprehend code intent, which results in an inability to identify deviations between code implementation and design intent (i.e., functional defects). Conversely, although LLMTG demonstrates an advantage in terms of code intent analysis, it is generally characterized by low executability and necessitates iterative debugging. In order to enhance the ability of unit test generation to identify functional defects, a novel framework has been proposed, entitled the intent analysis-guided unit test generation and refinement (IGTG&R) model. The IGTG&R model consists of a two-stage process for test generation. In the first stage, we introduce coverage path entropy to enhance CGTG to achieve high executability and code coverage of test cases. The second stage refines the test cases using LLMs to identify functional defects. We quantify and verify the interference of incorrect code implementation on intent analysis through conditional entropy. In order to reduce this interference, the focal method body is excluded from the code context information during intent analysis. Using these two-stage process, IGTG&R achieves a more profound comprehension of the intent of the code and the identification of functional defects. The IGTG&R model has been demonstrated to achieve an identification rate of functional defects ranging from 65% to 89%, with an execution success rate of 100% and a code coverage rate of 75.8%. This indicates that IGTG&R is superior to the CGTG and LLMTG approaches in multiple aspects.

  • Research Article
  • Cite Count Icon 83
  • 10.3390/app12094369
RESTful API Testing Methodologies: Rationale, Challenges, and Solution Directions
  • Apr 26, 2022
  • Applied Sciences
  • Adeel Ehsan + 3 more

Service-oriented architecture has evolved to be the backbone for large-scale integration between different applications and platforms. This concept has led to today’s reality of cloud services. Many of the major business platforms are providing their services to end-users and other companies as well. Companies are crafting ways to allow other businesses fast service integration and to get on board quickly in the market. REST (representational state transfer) has emerged as the standard protocol for implementing and consuming these services, which are called RESTful application programming interfaces (APIs). As the internal details of the RESTful APIs are not completely available during consumption, thorough testing has been a major challenge. Any unprecedented change in the APIs can cause the major failure of service operations, which can cause an organization to face both financial and trust losses. Research efforts have been made to alleviate testing challenges by introducing different frameworks and auto-generating unit test approaches. However, there is still a lack of an overview of the state-of-the-art in RESTful API testing. As such, the objective of this article is to identify, analyze, and synthesize the studies that have been performed related to RESTful APIs’ testing methodologies and unit test generation. With this perspective, a systematic literature review (SLR) study was conducted. In total, 16 papers were retrieved and included based on study selection criteria for in-depth analysis. This SLR discusses and categorizes different problems and solutions related to RESTful APIs’ testing and unit test generation.

  • Conference Article
  • Cite Count Icon 160
  • 10.1109/icse-seip.2017.27
An Industrial Evaluation of Unit Test Generation: Finding Real Faults in a Financial Application
  • May 1, 2017
  • M Moein Almasi + 4 more

Automated unit test generation has been extensively studied in the literature in recent years. Previous studies on open source systems have shown that test generation tools are quite effective at detecting faults, but how effective and applicable are they in an industrial application? In this paper, we investigate this question using a life insurance and pension products calculator engine owned by SEB Life & Pension Holding AB Riga Branch. To study fault-finding effectiveness, we extracted 25 real faults from the version history of this software project, and applied two up-to-date unit test generation tools for Java, EVOSUITE and RANDOOP, which implement search-based and feedback-directed random test generation, respectively. Automatically generated test suites detected up to 56.40% (EVOSUITE) and 38.00% (RANDOOP) of these faults. The analysis of our results demonstrates challenges that need to be addressed in order to improve fault detection in test generation tools. In particular, classification of the undetected faults shows that 97.62% of them depend on either specific primitive values (50.00%) or the construction of complex state configuration of objects (47.62%). To study applicability, we surveyed the developers of the application under test on their experience and opinions about the test generation tools and the generated test cases. This leads to insights on requirements for academic prototypes for successful technology transfer from academic research to industrial practice, such as a need to integrate with popular build tools, and to improve the readability of the generated tests.

Save Icon
Up Arrow
Open/Close