Automatic Unit Test Generation for Machine Learning Libraries: How Far Are We?

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Automatic unit test generation that explores the input space and produces effective test cases for given programs have been studied for decades. Many unit test generation tools that can help generate unit test cases with high structural coverage over a program have been examined. However, the fact that existing test generation tools are mainly evaluated on general software programs calls into question about its practical effectiveness and usefulness for machine learning libraries, which are statistically orientated and have fundamentally different nature and construction from general software projects. In this paper, we set out to investigate the effectiveness of existing unit test generation techniques on machine learning libraries. To investigate this issue, we conducted an empirical study on five widely used machine learning libraries with two popular unit testcase generation tools, i.e., EVOSUITE and Randoop. We find that (1) most of the machine learning libraries do not maintain a high-quality unit test suite regarding commonly applied quality metrics such as code coverage (on average is 34.1%) and mutation score (on average is 21.3%), (2) unit test case generation tools, i.e., EVOSUITE and Randoop, lead to clear improvements in code coverage and mutation score, however, the improvement is limited, and (3) there exist common patterns in the uncovered code across the five machine learning libraries that can be used to improve unit test case generation tasks.

Similar Papers
  • Conference Article
  • Cite Count Icon 49
  • 10.1145/3624032.3624035
An initial investigation of ChatGPT unit test generation capability
  • Sep 25, 2023
  • Vitor Guilherme + 1 more

Context: Software testing ensures software quality, but developers often disregard it. The use of automated testing generation is pursued to reduce the consequences of overlooked test cases in a software project. Problem: In the context of Java programs, several tools can completely automate generating unit test sets. Additionally, studies are conducted to offer evidence regarding the quality of the generated test sets. However, it is worth noting that these tools rely on machine learning and other AI algorithms rather than incorporating the latest advancements in Large Language Models (LLMs). Solution: This work aims to evaluate the quality of Java unit tests generated by an OpenAI LLM algorithm, using metrics like code coverage and mutation test score. Method: For this study, 33 programs used by other researchers in the field of automated test generation were selected. This approach was employed to establish a baseline for comparison purposes. For each program, 33 unit test sets were generated automatically, without human interference, by changing Open AI API parameters. After executing each test set, metrics such as code line coverage, mutation score, and success rate of test execution were collected to evaluate the efficiency and effectiveness of each set. Summary of Results: Our findings revealed that the OpenAI LLM test set demonstrated similar performance across all evaluated aspects compared to traditional automated Java test generation tools used in the previous research. These results are particularly remarkable considering the simplicity of the experiment and the fact that the generated test code did not undergo human analysis.

  • Conference Article
  • Cite Count Icon 20
  • 10.1145/3650212.3680354
Domain Adaptation for Code Model-Based Unit Test Case Generation
  • Sep 11, 2024
  • Jiho Shin + 3 more

Recently, deep learning-based test case generation approaches have been proposed to automate the generation of unit test cases. In this study, we leverage Transformer-based code models to generate unit tests with the help of Domain Adaptation (DA) at a project level. Specifically, we use CodeT5, a relatively small language model trained on source code data, and fine-tune it on the test generation task. Then, we apply domain adaptation to each target project data to learn project-specific knowledge (project-level DA). We use the Methods2test dataset to fine-tune CodeT5 for the test generation task and the Defects4j dataset for project-level domain adaptation and evaluation. We compare our approach with (a) CodeT5 fine-tuned on the test generation without DA, (b) the A3Test tool, and (c) GPT-4 on five projects from the Defects4j dataset. The results show that tests generated using DA can increase the line coverage by 18.62%, 19.88%, and 18.02% and mutation score by 16.45%, 16.01%, and 12.99% compared to the above (a), (b), and (c) baselines, respectively. The overall results show consistent improvements in metrics such as parse rate, compile rate, BLEU, and CodeBLEU. In addition, we show that our approach can be seen as a complementary solution alongside existing search-based test generation tools such as EvoSuite, to increase the overall coverage and mutation scores with an average of 34.42% and 6.8%, for line coverage and mutation score, respectively.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icicos56336.2022.9930600
Understandable Automatic Generated Unit Tests using Semantic and Format Improvement
  • Sep 28, 2022
  • Novi Setiani + 2 more

Unit testing is the important yet the most laborious testing activity because the developer must create and execute unit tests for each class that is created. Unit tests can be created manually by the developer or automatically by using code-based test case generation techniques, such as random, search-based, or symbolic execution techniques. The automatic generated testcases reducing developer effort in writing the unit test for every method and class. However, this automated unit test case more difficult to understand compared to manual unit test. After the unit tests are executed, the results must be validated or maintained by the developer team and they must understand the content of unit tests. This applies not only to the development phase but also to the software maintenance phase. Therefore, understandability is an important aspect that test cases need to have. To explore what kind of test cases are easy for developers to understand, a requirement gathering activities related to understandability improvement in unit test cases is conducted. This research involved the developers and experts in software development by exploring their opinion in semantic, format and function of automatic unit test cases. Based on expert's opinion and developer's interview, requirement list is mapped to the Evosuite and Randoop generated unit test case

  • Research Article
  • Cite Count Icon 24
  • 10.1007/s10009-009-0115-4
GenUTest: a unit test and mock aspect generation tool
  • Sep 3, 2009
  • International Journal on Software Tools for Technology Transfer
  • Benny Pasternak + 2 more

Unit testing plays a major role in the software development process. What started as an ad hoc approach is becoming a common practice among developers. It enables the immediate detection of bugs introduced into a unit whenever code changes occur. Hence, unit tests provide a safety net of regression tests and validation tests which encourage developers to refactor existing code with greater confidence. One of the major corner stones of the agile development approach is unit testing. Agile methods require all software classes to have unit tests that can be executed by an automated unit-testing framework. However, not all software systems have unit tests. When changes to such software are needed, writing unit tests from scratch, which is hard and tedious, might not be cost effective. In this paper we propose a technique which automatically generates unit tests for software that does not have such tests. We have implemented GenUTest, a prototype tool which captures and logs interobject interactions occurring during the execution of Java programs, using the aspect-oriented language AspectJ. These interactions are used to generate JUnit tests. They also serve in generating mock aspects—mock object-like entities, which enable testing units in isolation. The generated JUnit tests and mock aspects are independent of the tool, and can be used by developers to perform unit tests on the software. Comprehensiveness of the unit tests depends on the software execution. We applied GenUTest to several open source projects such as NanoXML and JODE. We present the results, explain the limitations of the tool, and point out direction to future work to improve the code coverage provided by GenUTest and its scalability.

  • Dissertation
  • 10.31979/etd.kddt-d7ms
Multi-Model Unit Test Generation Framework With Reinforcement Learning
  • Jan 1, 2025
  • Tasman Kuang

Unit test generation is a critical step in the software development lifecycle to ensure code quality and reduce the likelihood of bugs. Manually writing unit tests can be time-consuming and require an experienced developer. However with the emergence of generative AI, large language models (LLMs) in particular have demonstrated their effectiveness in generating code, which naturally brings up the question of the possibility of applying this capability to automate unit test generation. One of the newer techniques in this field is using Reinforcement Learning (RL) to train a model to generate quality unit tests. RL is the practice of training an agent to take optimal actions to maximize a reward signal. By treating the LLM as an agent and fine-tuning its parameters through feedback from the reward signal, it offers an adaptive and flexible method for improving LLM performance instead of relying on pre-trained models. This project explores different methodologies to augment a multi-model unit test generation framework including the use of RL to train its test generation capabilities. Using datasets derived from LeetCode and PyMethods2Test, our tool is evaluated against strong baseline LLMs like Gemini and Claude. The results show that the PPO-trained DeepSeek model consistently outperforms baseline generation, achieving higher test pass rates, fewer syntax errors, and improved coverage and mutation scores across both datasets, demonstrating that our framework presents an effective unit test generation method.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.1007/s10664-024-10451-x
Toward granular search-based automatic unit test case generation
  • May 17, 2024
  • Empirical Software Engineering
  • Fabiano Pecorelli + 4 more

Unit testing verifies the presence of faults in individual software components. Previous research has been targeting the automatic generation of unit tests through the adoption of random or search-based algorithms. Despite their effectiveness, these approaches aim at creating tests by solely optimizing metrics like code coverage, without ensuring that the resulting tests have granularities that would allow them to verify both the behavior of individual production methods and the interaction between methods of the class under test. To address this limitation, we propose a two-step systematic approach to the generation of unit tests: we first force search-based algorithms to create tests that cover individual methods of the production code, hence implementing the so-called intra-method tests; then, we relax the constraints to enable the creation of intra-class tests that target the interactions among production code methods. The assessment of our approach is conducted through a mixed-method research design that combines statistical analyses with a user study. The key results report that our approach is able to keep the same level of code and mutation coverage while providing test suites that are more structured, more understandable and aligned to the design principles of unit testing.

  • Research Article
  • Cite Count Icon 1
  • 10.21609/jiki.v17i1.1198
Implementation Genetic Algorithm for Optimization of Kotlin Software Unit Test Case Generator
  • Feb 25, 2024
  • Jurnal Ilmu Komputer dan Informasi
  • Mohammad Andiez Satria Permana + 2 more

Unit testing has a significant role in software development and its impacts depend on the quality of test cases and test data used. To reduce time and effort, unit test generator systems can help automatically generate test cases and test data. However, there is currently no unit test generator for Kotlin programming language even though this language is popularly used for android application developments. In this study, we propose and develop a test generator system that utilizes genetic algorithm (GA) and ANTLR4 parser. GA is used to obtain the most optimal test cases and data for a given Kotlin code. ANTLR4 parser is used to optimize the mutation process in GA so that the mutation process is not totally random. Our model results showed that the average value of code coverage in generated unit tests against instruction coverage is 95.64%, with branch coverage of 76.19% and line coverage of 96.87%. In addition, only two out of eight generated classes produced duplicate test cases with a maximum of one duplication in each class. Therefore, it can be concluded that our optimization with GA on the unit test generator is able to produce unit tests with high code coverage and low duplication.

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/3593434.3593443
NxtUnit: Automated Unit Test Generation for Go
  • Jun 14, 2023
  • Siwei Wang + 5 more

Automated test generation has been extensively studied for dynamically compiled or typed programming languages like Java and Python. However, Go, a popular statically compiled and typed programming language for server application development, has received limited support from existing tools. To address this gap, we present NxtUnit, an automatic unit test generation tool for Go that uses random testing and is well-suited for microservice architecture. NxtUnit employs a random approach to generate unit tests quickly, making it ideal for smoke testing and providing quick quality feedback. It comes with three types of interfaces: an integrated development environment (IDE) plugin, a command-line interface (CLI), and a browser-based platform. The plugin and CLI tool allow engineers to write unit tests more efficiently, while the platform provides unit test visualization and asynchronous unit test generation. We evaluated NxtUnit by generating unit tests for 13 open-source repositories and 500 ByteDance in-house repositories, resulting in a code coverage of 20.74% for in-house repositories. We conducted a survey among Bytedance engineers and found that NxtUnit can save them 48% of the time on writing unit tests. We have made the CLI tool available at https://github.com/bytedance/nxt_unit.

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-540-77966-7_20
GenUTest: A Unit Test and Mock Aspect Generation Tool
  • Oct 23, 2007
  • Benny Pasternak + 2 more

Unit testing plays a major role in the software development process. It enables the immediate detection of bugs introduced into a unit whenever code changes occur. Hence, unit tests provide a safety net of regression tests and validation tests which encourage developers to refactor existing code. Nevertheless, not all software systems contain unit tests. When changes to such software are needed, writing unit tests from scratch might not be cost effective. In this paper we propose a technique which automatically generates unit tests for software that does not have such tests.We have implemented GenUTest, a tool which captures and logs inter-object interactions occurring during the execution of Java programs. These interactions are used to generate JUnit tests. They also serve in generating mock aspects - mock object like entities, which assist the testing process. The interactions are captured using the aspect oriented language AspectJ.

  • Research Article
  • Cite Count Icon 1
  • 10.1145/3765758
Reference-Based Retrieval-Augmented Unit Test Generation
  • Dec 3, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Zhe Zhang + 5 more

Automated unit test generation has been widely studied, with Large Language Models (LLMs) recently showing significant potential. LLMs like GPT-4, trained in vast text and code data, excel in various code-related tasks, including unit test generation. However, existing LLM-based approaches often focus solely on the context within the code itself, such as referenced variables, while neglecting broader task-specific contexts, such as the utility of referring to existing tests of relevant methods in unit test generation. Moreover, in the context of unit test generation, these tools prioritize high code coverage, often at the expense of practical usability, correctness, and maintainability. In response, we propose Reference-Based Retrieval Augmentation , a novel mechanism that extends LLM-based Retrieval-Augmented Generation (RAG) to retrieve relevant information by considering task-specific context. In the unit test generation task, for a given focal method, the reference relationships is defined as the reusability or referentiality of tests between the focal method and other methods. To generate high-quality unit tests for the focal method, the test reference relationships are then used to retrieve relevant methods and their existing unit tests. Specifically, we account for the unique structure of unit tests by dividing the test generation process into Given , When , and Then phases. When generating unit tests for a focal method, we retrieve pre-existing tests of other relevant methods, which can provide valuable insights for any of the Given , When , and Then phases. We implement this approach in a tool called RefTest , which sequentially performs preprocessing, test reference retrieval, and unit test generation, using an incremental strategy in which newly generated tests guide the creation of subsequent ones. We evaluated RefTest on 12 open-source projects with 1515 methods, and the results demonstrate that RefTest consistently outperforms existing tools in terms of correctness, completeness, and maintainability of the generated tests.

  • Conference Article
  • Cite Count Icon 161
  • 10.1109/icse-seip.2017.27
An Industrial Evaluation of Unit Test Generation: Finding Real Faults in a Financial Application
  • May 1, 2017
  • M Moein Almasi + 4 more

Automated unit test generation has been extensively studied in the literature in recent years. Previous studies on open source systems have shown that test generation tools are quite effective at detecting faults, but how effective and applicable are they in an industrial application? In this paper, we investigate this question using a life insurance and pension products calculator engine owned by SEB Life & Pension Holding AB Riga Branch. To study fault-finding effectiveness, we extracted 25 real faults from the version history of this software project, and applied two up-to-date unit test generation tools for Java, EVOSUITE and RANDOOP, which implement search-based and feedback-directed random test generation, respectively. Automatically generated test suites detected up to 56.40% (EVOSUITE) and 38.00% (RANDOOP) of these faults. The analysis of our results demonstrates challenges that need to be addressed in order to improve fault detection in test generation tools. In particular, classification of the undetected faults shows that 97.62% of them depend on either specific primitive values (50.00%) or the construction of complex state configuration of objects (47.62%). To study applicability, we surveyed the developers of the application under test on their experience and opinions about the test generation tools and the generated test cases. This leads to insights on requirements for academic prototypes for successful technology transfer from academic research to industrial practice, such as a need to integrate with popular build tools, and to improve the readability of the generated tests.

  • Research Article
  • 10.3390/e28010074
IGTG&R: An Intent Analysis-Guided Unit Test Generation and Refinement Framework
  • Jan 9, 2026
  • Entropy
  • Xiaojian Liu + 1 more

Code coverage-guided unit test generation (CGTG) and large language model-based test generation (LLMTG) are two principal approaches for the generation of unit tests. Each of these approaches has its inherent advantages and drawbacks. Tests generated by CGTG have been shown to exhibit high code coverage and high executability. However, they lack the capacity to comprehend code intent, which results in an inability to identify deviations between code implementation and design intent (i.e., functional defects). Conversely, although LLMTG demonstrates an advantage in terms of code intent analysis, it is generally characterized by low executability and necessitates iterative debugging. In order to enhance the ability of unit test generation to identify functional defects, a novel framework has been proposed, entitled the intent analysis-guided unit test generation and refinement (IGTG&R) model. The IGTG&R model consists of a two-stage process for test generation. In the first stage, we introduce coverage path entropy to enhance CGTG to achieve high executability and code coverage of test cases. The second stage refines the test cases using LLMs to identify functional defects. We quantify and verify the interference of incorrect code implementation on intent analysis through conditional entropy. In order to reduce this interference, the focal method body is excluded from the code context information during intent analysis. Using these two-stage process, IGTG&R achieves a more profound comprehension of the intent of the code and the identification of functional defects. The IGTG&R model has been demonstrated to achieve an identification rate of functional defects ranging from 65% to 89%, with an execution success rate of 100% and a code coverage rate of 75.8%. This indicates that IGTG&R is superior to the CGTG and LLMTG approaches in multiple aspects.

  • Research Article
  • Cite Count Icon 75
  • 10.1007/s10851-006-8530-6
Tool-assisted unit-test generation and selection based on operational abstractions
  • Jul 1, 2006
  • Automated Software Engineering
  • Tao Xie + 1 more

Unit testing, a common step in software development, presents a challenge. When produced manually, unit test suites are often insufficient to identify defects. The main alternative is to use one of a variety of automatic unit-test generation tools: these are able to produce and execute a large number of test inputs that extensively exercise the unit under test. However, without a priori specifications, programmers need to manually verify the outputs of these test executions, which is generally impractical. To reduce this cost, unit-test selection techniques may be used to help select a subset of automatically generated test inputs. Then programmers can verify their outputs, equip them with test oracles, and put them into the existing test suite. In this paper, we present the operational violation approach for unit-test generation and selection, a black-box approach without requiring a priori specifications. The approach dynamically generates operational abstractions from executions of the existing unit test suite. These operational abstractions guide test generation tools to generate tests to violate them. The approach selects those generated tests violating operational abstractions for inspection. These selected tests exercise some new behavior that has not been exercised by the existing tests. We implemented this approach by integrating the use of Daikon (a dynamic invariant detection tool) and Parasoft Jtest (a commercial Java unit testing tool), and conducted several experiments to assess the approach.

  • Conference Article
  • 10.5753/sast.2025.14036
On the Energy Footprint of Using a Small Language Model for Unit Test Generation
  • Sep 22, 2025
  • Rafael S Durelli + 2 more

Context. Manual unit test creation is a cognitively intensive and time-consuming activity, prompting researchers and practitioners to increasingly adopt automated testing tools. Recent advancements in language models have expanded automation possibilities, including unit test generation, yet these models raise substantial sustainability concerns due to their energy consumption compared to conventional, specialized tools. Goal. Our research investigates whether the energy overhead associated with employing a small language model (SLM) for unit test generation is justified compared to a conventional, lightweight testing tool. We compare and analyze the energy consumption incurred during test suite generation, as well as the fault-finding effectiveness of the resulting test suites, for an SLM (Phi-3.1 Mini 128k) and Pynguin, a purpose-built tool for unit test generation. Method.We posed two research questions: (i) What is the difference in energy usage between Phi and Pynguin during the generation of unit test suites for Python programs?; and (ii) To what extent do unit test suites generated by Phi and Pynguin differ in their fault-finding effectiveness? To rigorously address the first research question, we employed Bayesian Data Analysis (BDA). For the second research question, we conducted a complementary empirical analysis using descriptive statistics. Results. Our Bayesian analysis provides robust evidence indicating that Phi consistently consumes significantly more energy than Pynguin during test suite generation. Conclusions. These findings underscore significant sustainability concerns associated with employing even SLMs for routine Software Engineering tasks such as unit test generation. The results challenge the assumption of universal energy efficiency benefits from smaller-scale models and emphasize the necessity for careful energy consumption evaluations in the adoption of automated software testing approaches.

  • Conference Article
  • Cite Count Icon 19
  • 10.1145/1328279.1328285
UnitPlus
  • Oct 21, 2007
  • Yoonki Song + 2 more

In the software development life cycle, unit testing is an important phase that helps in early detection of bugs. A unit test case consists of two parts: a test input, which is often a sequence of method calls, and a test oracle, which is often in the form of assertions. The effectiveness of a unit test case depends on its test input as well as its test oracle because the test oracle helps in exposing bugs during the execution of the test input. The task of writing effective test oracles is not trivial as this task requires domain or application knowledge and also needs knowledge of the intricate details of the class under test. In addition, when developers write new unit test cases, much test code (including code in test inputs or oracles) such as method argument values is the same as some previously written test code. To assist developers in writing test code in unit test cases more efficiently, we have developed an Eclipse plugin for JUnit test cases, called UnitPlus, that runs in the background and recommends test-code pieces for developers to choose (and revise when needed) to put in test oracles or test inputs. The recommendation is based on static analysis of the class under test and already written unit test cases. We have conducted a feasibility study for our UnitPlus plugin with four Java libraries to demonstrate its potential utility.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant