Articles published on Unit Test Generation
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
73 Search results
Sort by Recency
- Research Article
2
- 10.1145/3745765
- Mar 11, 2026
- ACM Transactions on Software Engineering and Methodology
- Junwei Zhang + 4 more
Recently, Large Language Models (LLMs) have shown promising results in code generation, and several automated test generation approaches based on LLMs have been proposed. Although these approaches achieve promising performance, they suffer from two limitations. First, they lack the intrinsic understanding of the semantic intricacies and logical constructs inherent to the focal method. Second, they ignore the diversity of the generated tests and generate tests with limited code coverage. To alleviate these two limitations, in this work, we propose a novel approach named TestCTRL that optimizes LLMs for unit test generation by the Chain-of-Thought (CoT) prompt and Reinforcement Learning (RL) strategy. Specifically, we first build a new CoT dataset, containing the focal methods, corresponding unit tests, and CoT prompts. The CoT prompt includes the intention and possible test input values. Then, the CoT dataset is used to fine-tune one LLM (i.e., CodeLlama 7B) that can be seen as the policy model in RL. Meanwhile, we fine-tune another LLM (i.e., CodeGPT) as the reward model by predicting the line coverage of the focal method and its test. Moreover, we employ the Proximal Policy Optimization (PPO) algorithm to optimize the policy model and generate unit tests. We use the Defects4J benchmark to evaluate our approach from three perspectives (i.e., naturalness, validity, and code coverage). To avoid data leakage threats, we filtered out data from the CoT dataset that have the same focal method and test case names as those in the Defects4J. The experimental results demonstrate that TestCTRL outperforms state-of-the-art baselines in line and branch coverages, respectively. Besides, TestCTRL improves bug detection performance. We also investigate the reason for the proposed approach’s superiority.
- Research Article
- 10.1016/j.infsof.2025.107948
- Feb 1, 2026
- Information and Software Technology
- Shaojian Qiu + 4 more
Boosting unit test generation via structure-aware fine-tuning of pre-trained model
- Research Article
- 10.35746/jtim.v8i1.912
- Jan 22, 2026
- Jurnal Teknologi Informasi dan Multimedia
- Salman El Farisi + 1 more
Automated unit testing is essential for ensuring the security and reliability of smart contracts, particularly because their immutable nature prevents post-deployment modifications. However, manually creating test scenarios remains time-consuming, costly, and highly dependent on expert knowledge. A potential solution is to utilize AI technology, particularly Large Language Models (LLMs), to automatically generate test scenarios. This study fills the research gap in leveraging LLM technology in the software testing space by proposing a workflow for automatically gener-ating unit test scenarios for blockchain smart contract code using Large Language Models (LLMs). The proposed workflow consists of two stages: converting Solidity smart contracts into structured Gherkin scenarios and translating those scenarios into executable Hardhat unit test scripts. This study proposes an automated workflow using Large Language Models (LLMs) to address these challenges. The workflow consists of two stages: con-verting Solidity smart con-tracts into structured Gherkin scenarios and trans-lating those scenarios into executable Hardhat unit test scripts. Using the Gemini 2.5 Pro model, the research evaluates three prompting tech-niques such as Chain-of-Thought, Few-Shot, and Role-Based through quantitative analysis based on code coverage metrics, including Statements, Branches, Functions, and Lines. The experimental results show that Role-Based Prompting achieves the highest average coverage (92.02%), fol-lowed by Few-Shot Prompting (89.52%), while Chain-of-Thought produces the lowest coverage (78.79%). Role-Based Prompting also attains the highest Branch coverage, demonstrating superi-or capability in capturing conditional logic within smart contracts.
- Research Article
- 10.3390/e28010074
- Jan 9, 2026
- Entropy
- Xiaojian Liu + 1 more
Code coverage-guided unit test generation (CGTG) and large language model-based test generation (LLMTG) are two principal approaches for the generation of unit tests. Each of these approaches has its inherent advantages and drawbacks. Tests generated by CGTG have been shown to exhibit high code coverage and high executability. However, they lack the capacity to comprehend code intent, which results in an inability to identify deviations between code implementation and design intent (i.e., functional defects). Conversely, although LLMTG demonstrates an advantage in terms of code intent analysis, it is generally characterized by low executability and necessitates iterative debugging. In order to enhance the ability of unit test generation to identify functional defects, a novel framework has been proposed, entitled the intent analysis-guided unit test generation and refinement (IGTG&R) model. The IGTG&R model consists of a two-stage process for test generation. In the first stage, we introduce coverage path entropy to enhance CGTG to achieve high executability and code coverage of test cases. The second stage refines the test cases using LLMs to identify functional defects. We quantify and verify the interference of incorrect code implementation on intent analysis through conditional entropy. In order to reduce this interference, the focal method body is excluded from the code context information during intent analysis. Using these two-stage process, IGTG&R achieves a more profound comprehension of the intent of the code and the identification of functional defects. The IGTG&R model has been demonstrated to achieve an identification rate of functional defects ranging from 65% to 89%, with an execution success rate of 100% and a code coverage rate of 75.8%. This indicates that IGTG&R is superior to the CGTG and LLMTG approaches in multiple aspects.
- Research Article
- 10.31891/2307-5732-2025-359-45
- Dec 11, 2025
- Herald of Khmelnytskyi National University. Technical sciences
- Володимир Маковишин
Modern automated test generation tools achieve high code coverage but largely ignore the semantic aspects of software. Large Language Models (LLMs) open new horizons in testing, particularly in creating meaningful, logically justified tests, conducting deep API documentation analysis, and detecting complex logical defects. This paper introduces the LLMTester method, which combines the intelligent capabilities of LLMs with classical testing approaches. The method involves automatic generation of unit tests and functional scenarios, evaluation of their semantic coverage as a complement to traditional metrics, and automated failure analysis. Experimental evaluation results based on the open-source web application Prestashop demonstrate a significant improvement in testing quality, reduction in test creation time, and increased defect detection efficiency compared to traditional approaches. Our work highlights the potential of LLMs not only for automation but also for intelligent enhancement of the software quality assurance process, particularly through the introduction of a new semantic coverage metric.
- Research Article
1
- 10.1145/3765758
- Dec 3, 2025
- ACM Transactions on Software Engineering and Methodology
- Zhe Zhang + 5 more
Automated unit test generation has been widely studied, with Large Language Models (LLMs) recently showing significant potential. LLMs like GPT-4, trained in vast text and code data, excel in various code-related tasks, including unit test generation. However, existing LLM-based approaches often focus solely on the context within the code itself, such as referenced variables, while neglecting broader task-specific contexts, such as the utility of referring to existing tests of relevant methods in unit test generation. Moreover, in the context of unit test generation, these tools prioritize high code coverage, often at the expense of practical usability, correctness, and maintainability. In response, we propose Reference-Based Retrieval Augmentation , a novel mechanism that extends LLM-based Retrieval-Augmented Generation (RAG) to retrieve relevant information by considering task-specific context. In the unit test generation task, for a given focal method, the reference relationships is defined as the reusability or referentiality of tests between the focal method and other methods. To generate high-quality unit tests for the focal method, the test reference relationships are then used to retrieve relevant methods and their existing unit tests. Specifically, we account for the unique structure of unit tests by dividing the test generation process into Given , When , and Then phases. When generating unit tests for a focal method, we retrieve pre-existing tests of other relevant methods, which can provide valuable insights for any of the Given , When , and Then phases. We implement this approach in a tool called RefTest , which sequentially performs preprocessing, test reference retrieval, and unit test generation, using an incremental strategy in which newly generated tests guide the creation of subsequent ones. We evaluated RefTest on 12 open-source projects with 1515 methods, and the results demonstrate that RefTest consistently outperforms existing tools in terms of correctness, completeness, and maintainability of the generated tests.
- Research Article
- 10.69849/revistaft/pa10202511151938
- Nov 15, 2025
- Revista ft
- Gabriel Mendes Dias + 2 more
The field of software testing faces complexity and scale challenges due to technological evolution, making unit testing crucial for quality assurance and preventing high-impact failures. This study aimed to analyze the importance of unit testing for system robustness and to validate the potential of Artificial Intelligence (AI) in its automation. A bibliographic review on software testing was conducted, followed by a practical comparative study where the Sieve of Eratosthenes function was provided to three Generative AI tools (Gemini, ChatGPT, and GitHub Copilot) to generate unit test suites in Python. The results confirmed that unit testing implementation, accelerated by AI, leads to early fault detection. Although all AIs generated functional tests with minimal execution time, quality and robustness varied significantly. ChatGPT stood out with the highest logical coverage by explicitly including tests for negative numbers, an essential edge case in mathematical algorithms. In contrast, GitHub Copilot delivered the highest code quality by using the pytest framework and parametrization to cover five types of invalid inputs within a single structure, demonstrating superior intelligence in input robustness. It is concluded that AI is a powerful testing accelerator, optimizing creation, but human review and validation remain essential to ensure critical scenario coverage and adherence to coding best practices.
- Research Article
1
- 10.1007/s10515-025-00539-z
- Sep 18, 2025
- Automated Software Engineering
- Omur Sahin + 2 more
Search-Based Software Testing (SBST) has seen several success stories in academia and industry. The effectiveness of a search algorithm at solving a software engineering problem strongly depends on how such algorithm can navigate the fitness landscape of the addressed problem. The fitness landscape depends on the used fitness function. Understanding the properties of a fitness landscape can help to provide insight on how a search algorithm behaves on it. Such insight can provide valuable information to researchers to being able to design novel, more effective search algorithms and fitness functions tailored for a specific problem. Due to its importance, few fitness landscape analyses have been carried out in the scientific literature of SBST. However, those have been focusing on the problem of unit test generation, e.g., with state-of-the-art tools such as EvoSuite. In this paper, we replicate one such existing study. However, in our work we focus on system test generation, with the state-of-the-art tool EvoMaster. Based on an empirical study involving the testing of 23 web services, this enables us to provide valuable insight into this important testing domain of practical industrial relevance. Our results indicate that fitness landscapes are largely dominated by neutral regions (e.g., plateaus), which make the search process challenging. We observe that the presence of information content in the landscape can improve search guidance, while boolean flags are a primary contributor to neutrality. These findings confirm prior results in unit testing but also reveal system-level differences, particularly in how branch types impact search effectiveness. These insights suggest the need for improved fitness functions, testability transformations, and search operators tailored to system-level testing.
- Research Article
4
- 10.1016/j.cola.2025.101348
- Sep 1, 2025
- Journal of Computer Languages
- Ruofan Yang + 2 more
TestLoter: A logic-driven framework for automated unit test generation and error repair using large language models
- Research Article
4
- 10.1145/3763791
- Aug 26, 2025
- ACM Transactions on Software Engineering and Methodology
- Yuwei Zhang + 8 more
Unit testing plays a pivotal role in the software development lifecycle, as it ensures code quality. However, writing high-quality unit tests remains a time-consuming task for developers in practice. More recently, the application of large language models (LLMs) in automated unit test generation has demonstrated promising results. Existing approaches primarily focus on interpreted programming languages (e.g., Java), while mature solutions tailored to compiled programming languages like C++ are yet to be explored. The intricate language features of C++, such as pointers, templates, and virtual functions, pose particular challenges for LLMs in generating both executable and high-coverage unit tests. To tackle the aforementioned problems, this paper introduces CITYWALK , a novel LLM-based framework for C++ unit test generation. CITYWALK enhances LLMs by providing a comprehensive understanding of the dependency relationships within the project under test via program analysis. Furthermore, CITYWALK incorporates language-specific knowledge about C++ derived from project documentation and empirical observations, significantly improving the correctness of the LLM-generated unit tests. We implement CITYWALK by employing the widely popular LLM GPT-4o. The experimental results show that CITYWALK outperforms current state-of-the-art approaches on a collection of ten popular C++ projects. Our findings demonstrate the effectiveness of CITYWALK in generating high-quality C++ unit tests.
- Research Article
- 10.25140/2411-5363-2025-2(40)-312-324
- Aug 11, 2025
- Technical sciences and technologies
- Oleksii Kondus + 1 more
The fast-paced information technology market demands rapid and high-quality software development to stay competitive. However, routine tasks such as code documentation, refactoring, test creation, and ensuring S.O.L.I.D. principle compliance consume a significant amount of developer time, with studies showing that up to 50% of effort is spent on such activities. Existing tools, such as GitHub Copilot and Tabnine, offer partial automation but lack comprehensive S.O.L.I.D. analysis, flexible AI model selection, and seamless integration within environments like Visual Studio Code. This highlights the need for a robust solution to streamline workflows, aligning with the growing use of AI assistants to boost coding efficiency and quality.This study tackles these challenges by introducing the Smart AI Code Assistant, a VS Code extension that automates routine tasks using AI models such as GPT, Claude, Gemini, Grok, and DeepSeek. The research aims to enhance developer productivity through automated documentation, refactoring, unit test generation, and S.O.L.I.D. compliance checks within a unified interface. Unlike other tools, the module allows task-specific AI model selection for optimal speed, accuracy, and cost, and provides detailed S.O.L.I.D. analysis with actionable feedback, improving code architecture.The methodology involved analyzing automation trends, evaluating AI model capabilities, and developing S.O.L.I.D. verification methods. Built with JavaScript and Node.js, the module uses Tree-sitter for code analysis and supports languages like JavaScript, Java, and Python. Key features include safe documentation generation, modular refactoring, test integration, and S.O.L.I.D. violation reports with fixes. Experiments tested refactoring performance on a flawed JavaScript TaskManager class across 100 trials, assessing test pass rates and response times.The results of the experiments demonstrated varying effectiveness of AI models. Claude-3-7-sonnet and deepseek-chat achieved 100% test pass rates, with Claude faster (10.81 vs. 33.14 seconds). Gemini-2.0-flash balanced speed (4.24 seconds) and accuracy (97.75%), offering cost-effectiveness. The module’s cohesive VS Code integration reduces manual effort and enhances code quality. It is a practical tool with potential for expansion to languages like C# and Go, and CI/CD integration. The Smart AI Code Assistant advances software development by addressing existing tool limitations, enabling faster, higher-quality outputs.
- Research Article
- 10.33407/itlt.v107i3.6184
- Jun 29, 2025
- Information Technologies and Learning Tools
- Kateryna Osadcha + 3 more
This study aims to examine IT educators’ opinions on using Microsoft Copilot Chat for their professional tasks. The significance of this research lies in the increasing influence of generative AI technologies on learning and the necessity to evaluate their feasibility. The study employs an expert survey method based on a rating scale. 18 experts participated in it. The results indicate varying levels of satisfaction among experts with Microsoft Copilot Chat responses depending on the type of task. The highest-rated tasks were Trivia on a certain topic (4.67), unit test generation (4.50), optimise code (4.44), creating the content for slides on a certain topic (4.44), and creating a comparative table between different items (4.27).The tasks with the lowest ratings were creation of a logo for the conference (3.22), grading essays based on rubrics (3.17), identifying a logical fallacy in a particular article (3.00), convert the text in the image to a format that I can copy and paste (2.88), and creating a mind map to illustrate concepts (2.70).Therefore, using Microsoft Copilot Chat for these tasks with low ratings is not currently recommended. We used the SPSS Statistics suite to calculate Cronbach’s Alpha and Cronbach’s Alpha Based on Standardised Items. Based on the analysis of the experts’ responses, ratings were collected for each professional task for which a prompt was provided.The study’s practical significance lies in demonstrating to educators the capabilities of Microsoft Copilot Chat in performing their routine professional tasks. It has been particularly effective in several areas, including: administrative tasks (writing speeches, planning routes), assessment (developing tests, tasks for formative and summative assessment), communication (preparing information materials), lesson planning (generating ideas, creating graphic materials), programming assistance (explaining and optimising code), scientific activities (creating bibliographies, analysing articles), and others (e.g. playing intellectual games on the relevant topic). Future research opportunities are proposed, including the development of advanced training programs for IT educators on integrating AI into their professional practices and an examination of the effectiveness of these programs.
- Research Article
4
- 10.1145/3728970
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
- Jinwei Liu + 5 more
Unit testing plays a crucial role in bug detection and ensuring software correctness. It helps developers identify errors early in development, thereby reducing software defects. In recent years, large language models (LLMs) have demonstrated significant potential in automating unit test generation. However, using LLMs to generate unit tests faces many challenges. 1) The execution pass rate of the test cases generated by LLMs is low. 2) The test case coverage is inadequate, making it challenging to detect potential risks in the code. 3) Current research methods primarily focus on languages such as Java and Python, while studies on C programming are scarce, despite its importance in the real world. To address these challenges, we propose STRUT, a novel unit test generation method. STRUT utilizes structured test cases as a bridge between complex programming languages and LLMs. Instead of directly generating test code, STRUT guides LLMs to produce structured test cases, thereby alleviating the limitations of LLMs when generating code for programming languages with complex features. First, STRUT analyzes the context of focal methods and constructs structured seed test cases for them. These seed test cases then guide LLMs to generate a set of structured test cases. Subsequently, a rule-based approach is employed to convert the structured set of test cases into executable test code. We conducted a comprehensive evaluation of STRUT, which achieved an impressive execution pass rate of 96.01%, along with 77.67% line coverage and 63.60% branch coverage. This performance significantly surpasses that of the LLMs-based baseline methods and the symbolic execution tool SunwiseAUnit. These results highlight STRUT's superior capability in generating high-quality unit test cases by leveraging the strengths of LLMs while addressing their inherent limitations.
- Research Article
3
- 10.1145/3715778
- Jun 19, 2025
- Proceedings of the ACM on Software Engineering
- Junwei Zhang + 5 more
Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests.
- Research Article
- 10.1145/3729362
- Jun 19, 2025
- Proceedings of the ACM on Software Engineering
- Sujin Jang + 3 more
We present UnitCon, a system for synthesizing targeted unit testsfor runtime exceptions in Java programs. Targeted unit tests aim to reveal a bug at a specific location in the program under test. This capability benefits various tasks in software development, such as patch testing, crash reproduction, or static analysis alarm inspection. However, conventional unit test generation tools are mainly designed for regression tests by maximizing code coverage; hence they are not effective at such target-specific tasks. In this paper, we propose a novel synthesis technique that effectively guides the search for targeted unit tests. The key idea is to use static analysis to prune and prioritize the search space by estimating the semantics of candidate test cases. This allows us to efficiently focus on promising unit tests that are likely to trigger runtime exceptions at the target location. According to our experiments on a suite of Java programs, our approach outperforms the state-of-the-art unit test generation tools. We also applied UnitCon for inspecting static analysis alarms for null pointer exceptions (NPEs) in 51 open-source projects and discovered 21 previously unknown NPE bugs.
- Research Article
3
- 10.15587/2706-5448.2025.330595
- May 26, 2025
- Technology audit and production reserves
- Ihor Pysmennyi + 2 more
Traditional quality assurance (QA) methods face significant challenges in addressing the complexity, scale, and rapid iteration cycles of modern software systems and are strained by limited resources available, leading to substantial costs associated with poor quality. The object of this research is the quality assurance processes for modern distributed software applications. The subject of the research is the assessment of the benefits, challenges, and prospects of integrating modern AI-oriented tools into quality assurance processes. Comprehensive analysis of implications was performed on both verification and validation processes covering exploratory test analyses, equivalence partitioning and boundary analyses, metamorphic testing, finding inconsistencies in acceptance criteria (AC), static analyses, test case generation, unit test generation, test suit optimization and assessment, end to end scenario execution. End to end regression of sample enterprise application utilizing AI-agents over generated test scenarios was implemented as a proof of concept highlighting practical use of the study. The results, with only 8.3% flaky executions of generated test cases, indicate significant potential for the proposed approaches. However, the study also identified substantial challenges for practical adoption concerning generation of semantically identical coverage, “black box” nature and lack of explainability from state-of-the-art Large Language Models (LLMs), the tendency to correct mutated test cases to match expected results, underscoring the necessity for thorough verification of both generated artifacts and test execution results. The research demonstrates AI's transformative potential for QA but highlights the importance of a strategic approach to implementing these technologies, considering the identified limitations and the need for developing appropriate verification methodologies.
- Research Article
- 10.15388/mitt.2025.32
- May 12, 2025
- Vilnius University Open Series
- Dovydas Marius Zapkus + 1 more
Unit testing is critical in software quality assurance, and large language models (LLMs) offer an approach to automate this process. This paper evaluates the quality of unit tests generated by large language models using structured output prompts. The research applied six LLMs in generating unit tests across different classes of cyclomatic complexity of C# focal methods. The experiment result shows that LLMs generated results according to a strict structure output (Arrange-Act-Assert pattern) that significantly influences the quality of the generated unit tests.
- Research Article
1
- 10.1609/aaai.v39i28.35246
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
- Akhil Deo
Unit testing is essential for ensuring software quality, but it is often time-consuming and prone to developer oversight. With the rise of large language models (LLMs) in code generation, there is an increasing need for reliable and automated test generation systems. This work presents QAagent, a multi-agent system designed to generate unit tests using natural language pseudocode. QAagent leverages LLMs to create a detailed natural language plan of a function's implementation and then generates a comprehensive suite of test cases covering both base and edge scenarios. Experiments conducted on two widely-used benchmarks, HumanEval and MBPP, show that QAagent consistently outperforms existing frameworks in terms of code coverage, although its accuracy varies across datasets, demonstrating the potential for utilizing natural language pseudocode to to enhance automated test generation in LLM-driven coding environments.
- Research Article
6
- 10.3390/electronics14071463
- Apr 4, 2025
- Electronics
- Shaheer Rehan + 2 more
Software testing is critical for ensuring software reliability, with test case generation often being resource-intensive and time-consuming. This study leverages the Llama-2 large language model (LLM) to automate unit test generation for Java focal methods, demonstrating the potential of AI-driven approaches to optimize software testing workflows. Our work leverages focal methods to prioritize critical components of the code to produce more context-sensitive and scalable test cases. The dataset, comprising 25,000 curated records, underwent tokenization and QLoRA quantization to facilitate training. The model was fine-tuned, achieving a training loss of 0.046. These results show the promise of AI-driven test case generation and underscore the feasibility of using fine-tuned LLMs for test case generation, highlighting opportunities for improvement through larger datasets, advanced hyperparameter optimization, and enhanced computational resources. We conducted a human-in-the-loop validation on a subset of unit tests generated by our fined-tuned LLM. This confirms that these tests effectively leverage focal methods, demonstrating the model’s capability to generate more contextually accurate unit tests. The work suggests the need to develop novel validation objective metrics specifically tailored for the automation of test cases generated by utilizing large language models. This work establishes a foundation for scalable and efficient software testing solutions driven by artificial intelligence. The data and code are publicly available on GitHub.
- Research Article
1
- 10.1109/access.2025.3597049
- Jan 1, 2025
- IEEE Access
- Sintayehu Zekarias Esubalew + 1 more
The rapid evolution of software development necessitates efficient unit testing to ensure reliability, yet manual test case generation is labor-intensive and often inadequate for agile workflows. Despite advancements, a comprehensive review of AI-driven unit test case generation, particularly for Java, is lacking, motivating this study to address this gap. The paper examines AI algorithms for unit test case generation, focusing on Java-specific challenges like class hierarchies and dependency injection. We propose a novel taxonomy categorizing methods into traditional machine learning (e.g., genetic algorithms, SVMs), deep learning (e.g., RNNs, GNNs), and transformer-based approaches (e.g., PLBART, enhanced by LoRA and QLoRA). Key contributions include: 1) a structured taxonomy for comparing AI methods based on effectiveness, usability, and maintainability; 2) a Java-specific focus addressing enterprise system complexities; and 3) identification of research gaps, such as scalability and assertion accuracy. Findings reveal that transformer-based models like A3Test and ChatUniTest achieve up to 59% test case correctness and 77% focal method coverage, outperforming traditional methods, though challenges in computational cost and assertion accuracy persist. Tools are evaluated for integration with CI/CD pipelines and advanced capabilities like parameter-efficient fine-tuning (PEFT) using LoRA and QLoRA. This review provides a roadmap for researchers and practitioners to advance automated, high-quality unit testing for Java software quality.