On the Energy Footprint of Using a Small Language Model for Unit Test Generation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Context. Manual unit test creation is a cognitively intensive and time-consuming activity, prompting researchers and practitioners to increasingly adopt automated testing tools. Recent advancements in language models have expanded automation possibilities, including unit test generation, yet these models raise substantial sustainability concerns due to their energy consumption compared to conventional, specialized tools. Goal. Our research investigates whether the energy overhead associated with employing a small language model (SLM) for unit test generation is justified compared to a conventional, lightweight testing tool. We compare and analyze the energy consumption incurred during test suite generation, as well as the fault-finding effectiveness of the resulting test suites, for an SLM (Phi-3.1 Mini 128k) and Pynguin, a purpose-built tool for unit test generation. Method.We posed two research questions: (i) What is the difference in energy usage between Phi and Pynguin during the generation of unit test suites for Python programs?; and (ii) To what extent do unit test suites generated by Phi and Pynguin differ in their fault-finding effectiveness? To rigorously address the first research question, we employed Bayesian Data Analysis (BDA). For the second research question, we conducted a complementary empirical analysis using descriptive statistics. Results. Our Bayesian analysis provides robust evidence indicating that Phi consistently consumes significantly more energy than Pynguin during test suite generation. Conclusions. These findings underscore significant sustainability concerns associated with employing even SLMs for routine Software Engineering tasks such as unit test generation. The results challenge the assumption of universal energy efficiency benefits from smaller-scale models and emphasize the necessity for careful energy consumption evaluations in the adoption of automated software testing approaches.

Similar Papers
  • Research Article
  • Cite Count Icon 75
  • 10.1007/s10851-006-8530-6
Tool-assisted unit-test generation and selection based on operational abstractions
  • Jul 1, 2006
  • Automated Software Engineering
  • Tao Xie + 1 more

Unit testing, a common step in software development, presents a challenge. When produced manually, unit test suites are often insufficient to identify defects. The main alternative is to use one of a variety of automatic unit-test generation tools: these are able to produce and execute a large number of test inputs that extensively exercise the unit under test. However, without a priori specifications, programmers need to manually verify the outputs of these test executions, which is generally impractical. To reduce this cost, unit-test selection techniques may be used to help select a subset of automatically generated test inputs. Then programmers can verify their outputs, equip them with test oracles, and put them into the existing test suite. In this paper, we present the operational violation approach for unit-test generation and selection, a black-box approach without requiring a priori specifications. The approach dynamically generates operational abstractions from executions of the existing unit test suite. These operational abstractions guide test generation tools to generate tests to violate them. The approach selects those generated tests violating operational abstractions for inspection. These selected tests exercise some new behavior that has not been exercised by the existing tests. We implemented this approach by integrating the use of Daikon (a dynamic invariant detection tool) and Parasoft Jtest (a commercial Java unit testing tool), and conducted several experiments to assess the approach.

  • Conference Article
  • Cite Count Icon 227
  • 10.1109/ase.2015.86
Do Automatically Generated Unit Tests Find Real Faults? An Empirical Study of Effectiveness and Challenges (T)
  • Nov 1, 2015
  • Sina Shamshiri + 5 more

Rather than tediously writing unit tests manually, tools can be used to generate them automatically - sometimes even resulting in higher code coverage than manual testing. But how good are these tests at actually finding faults? To answer this question, we applied three state-of-the-art unit test generation tools for Java (Randoop, EvoSuite, and Agitar) to the 357 real faults in the Defects4J dataset and investigated how well the generated test suites perform at detecting these faults. Although the automatically generated test suites detected 55.7% of the faults overall, only 19.9% of all the individual test suites detected a fault. By studying the effectiveness and problems of the individual tools and the tests they generate, we derive insights to support the development of automated unit test generators that achieve a higher fault detection rate. These insights include 1) improving the obtained code coverage so that faulty statements are executed in the first instance, 2) improving the propagation of faulty program states to an observable output, coupled with the generation of more sensitive assertions, and 3) improving the simulation of the execution environment to detect faults that are dependent on external factors such as date and time.

  • Conference Article
  • Cite Count Icon 40
  • 10.1109/ase.2003.1240293
Tool-assisted unit test selection based on operational violations
  • Jan 23, 2004
  • Tao Xie + 1 more

Unit testing, a common step in software development, presents a challenge. When produced manually, unit test suites are often insufficient to identify defects. The main alternative is to use one of a variety of automatic unit test generation tools: these are able to produce and execute a large number of test inputs that extensively exercise the unit under test. However, without a priori specifications, developers need to manually verify the outputs of these test executions, which is generally impractical. To reduce this cost, unit test selection techniques may be used to help select a subset of automatically generated test inputs. Then developers can verify their outputs, equip them with test oracles, and put them into the existing test suite. In this paper, we present the operational violation approach for unit test selection, a black-box approach without requiring a priori specifications. The approach dynamically generates operational abstractions from executions of the existing unit test suite. Any automatically generated tests violating the operational abstractions are identified as candidates for selection. In addition, these operational abstractions can guide test generation tools to produce better tests. To experiment dynamic approach, we integrated the use of Daikon (a dynamic invariant detection tool) and Jtest (a commercial Java unit testing tool). An experiment is conducted to assess this approach.

  • Research Article
  • Cite Count Icon 13
  • 10.1098/rstb.2017.0381
Towards systematic, data-driven validation of a collaborative, multi-scale model of Caenorhabditis elegans.
  • Sep 10, 2018
  • Philosophical Transactions of the Royal Society B: Biological Sciences
  • Richard C Gerkin + 2 more

The OpenWorm Project is an international open-source collaboration to create a multi-scale model of the organism Caenorhabditis elegans At each scale, including subcellular, cellular, network and behaviour, this project employs one or more computational models that aim to recapitulate the corresponding biological system at that scale. This requires that the simulated behaviour of each model be compared with experimental data both as the model is continuously refined and as new experimental data become available. Here we report the use of SciUnit, a software framework for model validation, to attempt to achieve these goals. During project development, each model is continuously subjected to data-driven 'unit tests' that quantitatively summarize model-data agreement, identifying modelling progress and highlighting particular aspects of each model that fail to adequately reproduce known features of the biological organism and its components. This workflow is publicly visible via both GitHub and a web application and accepts community contributions to ensure that modelling goals are transparent and well-informed.This article is part of a discussion meeting issue 'Connectome to behaviour: modelling C. elegans at cellular resolution'.

  • Conference Article
  • Cite Count Icon 36
  • 10.1109/icse43902.2021.00138
Automatic Unit Test Generation for Machine Learning Libraries: How Far Are We?
  • May 1, 2021
  • Song Wang + 5 more

Automatic unit test generation that explores the input space and produces effective test cases for given programs have been studied for decades. Many unit test generation tools that can help generate unit test cases with high structural coverage over a program have been examined. However, the fact that existing test generation tools are mainly evaluated on general software programs calls into question about its practical effectiveness and usefulness for machine learning libraries, which are statistically orientated and have fundamentally different nature and construction from general software projects. In this paper, we set out to investigate the effectiveness of existing unit test generation techniques on machine learning libraries. To investigate this issue, we conducted an empirical study on five widely used machine learning libraries with two popular unit testcase generation tools, i.e., EVOSUITE and Randoop. We find that (1) most of the machine learning libraries do not maintain a high-quality unit test suite regarding commonly applied quality metrics such as code coverage (on average is 34.1%) and mutation score (on average is 21.3%), (2) unit test case generation tools, i.e., EVOSUITE and Randoop, lead to clear improvements in code coverage and mutation score, however, the improvement is limited, and (3) there exist common patterns in the uncovered code across the five machine learning libraries that can be used to improve unit test case generation tasks.

  • Book Chapter
  • Cite Count Icon 13
  • 10.1007/978-3-319-66299-2_17
Diversity in Search-Based Unit Test Suite Generation
  • Jan 1, 2017
  • Nasser M Albunian

Search-based unit test generation is often based on evolutionary algorithms. Lack of diversity in the population of an evolutionary algorithm may lead to premature convergence at local optima, which would negatively affect the code coverage in test suite generation. While methods to improve population diversity are well-studied in the literature on genetic algorithms (GAs), little attention has been paid to diversity in search-based unit test generation so far. The aim of our research is to study the effects of population diversity on search-based unit test generation by applying different diversity maintenance and control techniques. As a first step towards understanding the influence of population diversity on the test generation, we adapt diversity measurements based on phenotypic and genotypic representation to the search space of unit test suites.

  • Research Article
  • Cite Count Icon 30
  • 10.1002/stvr.1660
Random or evolutionary search for object‐oriented test suite generation?
  • Mar 30, 2018
  • Software Testing, Verification and Reliability
  • Sina Shamshiri + 6 more

SummaryAn important aim in software testing is constructing a test suite with high structural code coverage, that is, ensuring that most if not all of the code under test have been executed by the test cases comprising the test suite. Several search‐based techniques have proved successful at automatically generating tests that achieve high coverage. However, despite the well‐established arguments behind using evolutionary search algorithms (eg, genetic algorithms) in preference to random search, it remains an open question whether the benefits can actually be observed in practice when generating unit test suites for object‐oriented classes. In this paper, we report an empirical study on the effects of using evolutionary algorithms (including a genetic algorithm and chemical reaction optimization) to generate test suites, compared with generating test suites incrementally with random search. We apply the EVOSUITEunit test suite generator to 1000 classes randomly selected from the SF110 corpus of open‐source projects. Surprisingly, the results show that the difference is much smaller than one might expect: While evolutionary search covers more branches of the type where standard fitness functions provide guidance, we observed that, in practice, the vast majority of branches do not provide any guidance to the search. These results suggest that, although evolutionary algorithms are more effective at covering complex branches, a random search may suffice to achieve high coverage of most object‐oriented classes.

  • Research Article
  • Cite Count Icon 1
  • 10.1145/3765758
Reference-Based Retrieval-Augmented Unit Test Generation
  • Dec 3, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Zhe Zhang + 5 more

Automated unit test generation has been widely studied, with Large Language Models (LLMs) recently showing significant potential. LLMs like GPT-4, trained in vast text and code data, excel in various code-related tasks, including unit test generation. However, existing LLM-based approaches often focus solely on the context within the code itself, such as referenced variables, while neglecting broader task-specific contexts, such as the utility of referring to existing tests of relevant methods in unit test generation. Moreover, in the context of unit test generation, these tools prioritize high code coverage, often at the expense of practical usability, correctness, and maintainability. In response, we propose Reference-Based Retrieval Augmentation , a novel mechanism that extends LLM-based Retrieval-Augmented Generation (RAG) to retrieve relevant information by considering task-specific context. In the unit test generation task, for a given focal method, the reference relationships is defined as the reusability or referentiality of tests between the focal method and other methods. To generate high-quality unit tests for the focal method, the test reference relationships are then used to retrieve relevant methods and their existing unit tests. Specifically, we account for the unique structure of unit tests by dividing the test generation process into Given , When , and Then phases. When generating unit tests for a focal method, we retrieve pre-existing tests of other relevant methods, which can provide valuable insights for any of the Given , When , and Then phases. We implement this approach in a tool called RefTest , which sequentially performs preprocessing, test reference retrieval, and unit test generation, using an incremental strategy in which newly generated tests guide the creation of subsequent ones. We evaluated RefTest on 12 open-source projects with 1515 methods, and the results demonstrate that RefTest consistently outperforms existing tools in terms of correctness, completeness, and maintainability of the generated tests.

  • Conference Article
  • Cite Count Icon 60
  • 10.1145/2739480.2754696
Random or Genetic Algorithm Search for Object-Oriented Test Suite Generation?
  • Jul 11, 2015
  • Sina Shamshiri + 3 more

Achieving high structural coverage is an important aim in software testing. Several search-based techniques have proved successful at automatically generating tests that achieve high coverage. However, despite the well- established arguments behind using evolutionary search algorithms (e.g., genetic algorithms) in preference to random search, it remains an open question whether the benefits can actually be observed in practice when generating unit test suites for object-oriented classes. In this paper, we report an empirical study on the effects of using a genetic algorithm (GA) to generate test suites over generating test suites incrementally with random search, by applying the EvoSuite unit test suite generator to 1,000 classes randomly selected from the SF110 corpus of open source projects. Surprisingly, the results show little difference between the coverage achieved by test suites generated with evolutionary search compared to those generated using random search. A detailed analysis reveals that the genetic algorithm covers more branches of the type where standard fitness functions provide guidance. In practice, however, we observed that the vast majority of branches in the analyzed projects provide no such guidance.

  • Conference Article
  • Cite Count Icon 161
  • 10.1109/icse-seip.2017.27
An Industrial Evaluation of Unit Test Generation: Finding Real Faults in a Financial Application
  • May 1, 2017
  • M Moein Almasi + 4 more

Automated unit test generation has been extensively studied in the literature in recent years. Previous studies on open source systems have shown that test generation tools are quite effective at detecting faults, but how effective and applicable are they in an industrial application? In this paper, we investigate this question using a life insurance and pension products calculator engine owned by SEB Life & Pension Holding AB Riga Branch. To study fault-finding effectiveness, we extracted 25 real faults from the version history of this software project, and applied two up-to-date unit test generation tools for Java, EVOSUITE and RANDOOP, which implement search-based and feedback-directed random test generation, respectively. Automatically generated test suites detected up to 56.40% (EVOSUITE) and 38.00% (RANDOOP) of these faults. The analysis of our results demonstrates challenges that need to be addressed in order to improve fault detection in test generation tools. In particular, classification of the undetected faults shows that 97.62% of them depend on either specific primitive values (50.00%) or the construction of complex state configuration of objects (47.62%). To study applicability, we surveyed the developers of the application under test on their experience and opinions about the test generation tools and the generated test cases. This leads to insights on requirements for academic prototypes for successful technology transfer from academic research to industrial practice, such as a need to integrate with popular build tools, and to improve the readability of the generated tests.

  • Research Article
  • 10.1145/3765754
TestLoop: A Process Model Describing Human-in-the-Loop Software Test Suite Generation
  • Sep 4, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Matthew C Davis + 5 more

There is substantial diversity among testing tools used by software engineers. For example, fuzzers may target crashes and security vulnerabilities while Test sUite Generators (TUGs) may create high-coverage test suites. In the research community, test generation tools are primarily evaluated using metrics like bugs identified or code coverage. However, achieving good values for these metrics does not necessarily imply that these tools help software engineers efficiently develop effective test suites. To understand the test suite generation process, we performed a secondary analysis of recordings from a previously-published user study in which 28 professional software engineers used two tools to generate test suites for three programs with each tool. From these 168 recordings ( \(28\ users\times 2\ tools\times 3\ programs/tool\) ), we extracted a process model of test suite generation called TestLoop that builds upon prior work and systematizes a user’s test suite generation process for a single function into 7 steps. We then used TestLoop’s steps to describe 8 prior and 10 new recordings of users generating test suites using the Jest, Hypothesis, and NaNofuzz test generation tools. Our results showed that TestLoop can be used to help answer previously hard-to-answer questions about how users interact with test suite generation tools and to identify ways that tools might be improved.

  • Book Chapter
  • 10.1007/978-3-642-04288-1_8
Effects on Unit Tests – Preliminary Analysis
  • Oct 30, 2009
  • Lech Madeyski

With the rising acceptance of XP, and agile methodologies in general, a growing number of software projects develop and maintain large test suites. Tests are considered a kind of a live documentation for the production code, because tests are always kept in sync with the code as opposed to typical text-based documentation which may not be in sync with the code. However, the first and foremost tests are used to execute a program with the intent of finding errors [191]. Hence, not only the thoroughness of developed tests (which is often taken into consideration by means of code coverage measures) but also the fault detection effectiveness of developed tests play the key part in software development. As explained in Sect. 3.3.2.4, code coverage measures can be useful as indicators of the thoroughness of unit test suites [170], while mutation score is a more powerful and more effective measure of the fault finding effectiveness of test suites than statement and branch coverage , and data-flow [82 ,197]. Unfortunately, the empirical evidence on the impact of the TF practice on unit test suite characteristics is limited to code coverage. Therefore, preliminary results, presented in this section, extend the body of knowledge in software engineering by means of the analysis of the impact of the TF practice on mutation score indicator (an indicator of the fault detection effectiveness of developed unit tests).

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/icodse.2015.7437005
Unit test code generator for lua programming language
  • Nov 1, 2015
  • Junno Tantra Pratama Wibowo + 2 more

Software testing is an important step in the software development lifecycle. One of the main process that take lots of time is developing the test code. We propose an automatic unit test code generation to speed up the process and helps avoiding repetition. We develop the unit test code generator using Lua programming language. Lua is a fast, lightweight, embeddable scripting language. It has been used in many industrial applications with focuses on embedded systems and games. Unlike other popular scripting language like JavaScript, Python, and Ruby, Lua does not have any unit test generator developed to help its software testing process. The final product, Lua unit test generator (LUTG), integrated to one of the most popular Lua IDE, ZeroBrane Studio, as a plugin to seamlessly connect the coding and testing process. The code generator can generate unit test code, save test cases data on Lua and XML file format, and generate the test data automatically using search-based technique, genetic algorithm, to achieve full branch coverage test criteria. Using this generator to test several Lua source code files shows that the developed unit test generator can help the unit testing process. It was expected that the unit test generator can improve productivity, quality, consistency, and abstraction of unit testing process.

  • Conference Article
  • Cite Count Icon 144
  • 10.1109/ase.2004.61
Rostra: a framework for detecting redundant object-oriented unit tests
  • Sep 20, 2004
  • Tao Xie + 2 more

Object-oriented unit tests consist of sequences of method invocations. Behavior of an invocation depends on the state of the receiver object and method arguments at the beginning of the invocation. Existing tools for automatic generation of object-oriented test suites, such as Jtest and J Crasher for Java, typically ignore this state and thus generate redundant tests that exercise the same method behavior, which increases the testing time without increasing the ability to detect faults. This work proposes Rostra, a framework for detecting redundant unit tests, and presents five fully automatic techniques within this framework. We use Rostra to assess and minimize test suites generated by test-generation tools. We also present how Rostra can be added to these tools to avoid generation of redundant tests. We have implemented the five Rostra techniques and evaluated them on 11 subjects taken from a variety of sources. The experimental results show that Jtest and JCrasher generate a high percentage of redundant tests and that Rostra can remove these redundant tests without decreasing the quality of test suites.

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/3593434.3593443
NxtUnit: Automated Unit Test Generation for Go
  • Jun 14, 2023
  • Siwei Wang + 5 more

Automated test generation has been extensively studied for dynamically compiled or typed programming languages like Java and Python. However, Go, a popular statically compiled and typed programming language for server application development, has received limited support from existing tools. To address this gap, we present NxtUnit, an automatic unit test generation tool for Go that uses random testing and is well-suited for microservice architecture. NxtUnit employs a random approach to generate unit tests quickly, making it ideal for smoke testing and providing quick quality feedback. It comes with three types of interfaces: an integrated development environment (IDE) plugin, a command-line interface (CLI), and a browser-based platform. The plugin and CLI tool allow engineers to write unit tests more efficiently, while the platform provides unit test visualization and asynchronous unit test generation. We evaluated NxtUnit by generating unit tests for 13 open-source repositories and 500 ByteDance in-house repositories, resulting in a code coverage of 20.74% for in-house repositories. We conducted a survey among Bytedance engineers and found that NxtUnit can save them 48% of the time on writing unit tests. We have made the CLI tool available at https://github.com/bytedance/nxt_unit.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant