AgoneTest: Automated creation and assessment of Unit tests leveraging Large Language Models
Software correctness is crucial, with unit testing playing an indispensable role in the software development lifecycle. However, creating unit tests is time-consuming and costly, underlining the need for automation. Leveraging Large Language Models (LLMs) for unit test generation is a promising solution, but existing studies focus on simple, small-scale scenarios, leaving a gap in understanding LLMs' performance in real-world applications, particularly regarding integration and assessment efficacy at scale. Here, we present AgoneTest, a system focused on automatically generating and evaluating complex class-level test suites. Our contributions include a scalable automated system, a newly developed dataset for rigorous evaluation, and a detailed methodology for test quality assessment.
- Discussion
2
- 10.1111/cogs.13430
- Mar 1, 2024
- Cognitive science
Large Language Models: A Historical and Sociocultural Perspective.
- Research Article
11
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Conference Article
133
- 10.1145/3510003.3510203
- May 21, 2022
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.
- Research Article
3
- 10.1145/3728951
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
Unit testing plays a pivotal role in software development, improving software quality and reliability. However, generating effective test cases manually is time-consuming, prompting interest in unit testing research. Recently, Large Language Models (LLMs) have shown potential in various unit testing tasks, including test generation, assertion generation, and test evolution, but existing studies are limited in scope and lack a systematic evaluation of the effectiveness of LLMs. To bridge this gap, we present a large-scale empirical study on fine-tuning LLMs for unit testing. Our study involves three unit testing tasks, five benchmarks, eight evaluation metrics, and 37 popular LLMs across various architectures and sizes, consuming over 3,000 NVIDIA A100 GPU hours. We focus on three key research questions: (1) the performance of LLMs compared to state-of-the-art methods, (2) the impact of different factors on LLM performance, and (3) the effectiveness of fine-tuning versus prompt engineering. Our findings reveal that LLMs outperform existing state-of-the-art approaches on all three unit testing tasks across nearly all metrics, highlighting the potential of fine-tuning LLMs in unit testing tasks. Furthermore, large-scale, decoder-only models achieve the best results across tasks, while encoder-decoder models perform better under the same parameter scale. Additionally, the comparison of the performance between fine-tuning and prompt engineering approaches reveals the considerable potential capability of the prompt engineering approach in unit testing tasks. We then discuss the concerned issues on the test generation task, including data leakage issues, bug detection capabilities, and metrics comparisons. Finally, we further pinpoint carious practical guidelines for LLM-based approaches to unit testing tasks in the near future. Overall, our work demonstrates the promising future of fine-tuning LLMs on unit testing tasks and reduces the manual efforts of unit testing experts in practical scenarios.
- Research Article
12
- 10.1016/j.procs.2023.09.086
- Jan 1, 2023
- Procedia Computer Science
A Large and Diverse Arabic Corpus for Language Modeling
- Supplementary Content
- 10.1108/ir-02-2025-0074
- Jul 29, 2025
- Industrial Robot: the international journal of robotics research and application
Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.
- Research Article
3
- 10.1038/s41698-025-00916-7
- May 23, 2025
- npj Precision Oncology
Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.
- Research Article
- 10.3348/kjr.2025.1045
- Jan 1, 2026
- Korean journal of radiology
To evaluate the accuracy and reasoning capabilities of large multimodal language models compared with those of neuroradiology subspecialty-trained radiologists in neuroradiology case interpretation. This experimental study used custom-made 401 radiologic quizzes derived from articles published in RadioGraphics covering neuroradiology and head and neck topics (October 2020 to February 2024). We prompted the GPT-4 Turbo with Vision (GPT-4V), GPT-4 Omni, Gemini Flash, and Claude models to provide the top three differential diagnoses with a rationale and describe examination characteristics such as imaging modality, sequence, use of contrast, image plane, and body part. The temperature was adjusted to 0 and 1 (T1). Two neuroradiologists answered the same questions. The accuracies of the large language models (LLMs) and the neuroradiologists were compared using generalized estimating equations. Three neuroradiologists assessed the rationale provided by the LLMs for their differential diagnoses using four-point scales, separately for specific lesion locations and imaging findings, and evaluated the presence of hallucinations and the overall acceptability of the responses. Top-3 accuracy (i.e., correct answers present among top-3 differential diagnoses) of LLMs ranged from 29.9% (120 of 401) to 49.4% (198 of 401, obtained with GPT-4V in the T1 setting), while radiologists achieved 80.3% (322 of 401) and 68.3% (274 of 401), respectively (P < 0.001). Regarding the rationale for differential diagnoses, GPT-4V (T1) accurately identified both the specific lesion location and imaging findings in 30.7% (123 of 401) and 12.9% (16 of 124) of cases without textual clinical history. Hallucinations occurred in 4.5% (18 of 401), and only 29.4% (118 of 401) of the LLM-generated analyses were deemed acceptable. GPT-4V (T1) demonstrated high accuracy in identifying the imaging modality (97.4% [800 of 821]) and scanned body parts (92.2% [756 of 820]). LLMs remarkably underperformed compared with neuroradiologists and showed unsatisfactory reasoning for their differential diagnoses, with performance declining further in cases without textual input of clinical history. These findings highlight the limitations of current multimodal LLMs in neuroradiological interpretation and their reliance on text input.
- Research Article
1
- 10.1080/13658816.2025.2577252
- Nov 1, 2025
- International Journal of Geographical Information Science
The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).
- Research Article
2
- 10.1145/3728970
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
Unit testing plays a crucial role in bug detection and ensuring software correctness. It helps developers identify errors early in development, thereby reducing software defects. In recent years, large language models (LLMs) have demonstrated significant potential in automating unit test generation. However, using LLMs to generate unit tests faces many challenges. 1) The execution pass rate of the test cases generated by LLMs is low. 2) The test case coverage is inadequate, making it challenging to detect potential risks in the code. 3) Current research methods primarily focus on languages such as Java and Python, while studies on C programming are scarce, despite its importance in the real world. To address these challenges, we propose STRUT, a novel unit test generation method. STRUT utilizes structured test cases as a bridge between complex programming languages and LLMs. Instead of directly generating test code, STRUT guides LLMs to produce structured test cases, thereby alleviating the limitations of LLMs when generating code for programming languages with complex features. First, STRUT analyzes the context of focal methods and constructs structured seed test cases for them. These seed test cases then guide LLMs to generate a set of structured test cases. Subsequently, a rule-based approach is employed to convert the structured set of test cases into executable test code. We conducted a comprehensive evaluation of STRUT, which achieved an impressive execution pass rate of 96.01%, along with 77.67% line coverage and 63.60% branch coverage. This performance significantly surpasses that of the LLMs-based baseline methods and the symbolic execution tool SunwiseAUnit. These results highlight STRUT's superior capability in generating high-quality unit test cases by leveraging the strengths of LLMs while addressing their inherent limitations.
- Research Article
13
- 10.1145/3699598
- Feb 23, 2025
- ACM Transactions on Software Engineering and Methodology
Unit testing aims to validate the correctness of software system units and has become an essential practice in software development and maintenance. However, it is incredibly time-consuming and labor-intensive for testing experts to write unit test cases manually, including test inputs (i.e., prefixes) and test oracles (i.e., assertions). Very recently, some techniques have been proposed to apply Large Language Models (LLMs) to generate unit assertions and have proven the potential in reducing manual testing efforts. However, there has been no systematic comparison of the effectiveness of these LLMs, and their pros and cons remain unexplored. To bridge this gap, we perform the first extensive study on applying various LLMs to automated assertion generation. The experimental results on two independent datasets show that studied LLMs outperform six state-of-the-art techniques with a prediction accuracy of 51.82%–58.71% and 38.72%–48.19%. The improvements achieve 29.60% and 12.47% on average. Besides, as a representative LLM, CodeT5 consistently outperforms all studied LLMs and all baselines on both datasets, with an average improvement of 13.85% and 26.64%, respectively. We also explore the performance of generated assertions in detecting real-world bugs, and find LLMs are able to detect 32 bugs from Defects4J on average, with an improvement of 52.38% against the most recent approach EditAS . Inspired by the findings, we construct a simplistic retrieval-and-repair-enhanced LLM-based approach by transforming the assertion generation problem into a program repair task for retrieved similar assertions. Surprisingly, such a simplistic approach can further improve the prediction accuracy of LLMs by 9.40% on average, leading to new records on both datasets. Besides, we provide additional discussions from different aspects (e.g., the impact of assertion types and test lengths) to illustrate the capacity and limitations of LLM-based approaches. Finally, we further pinpoint various practical guidelines (e.g., the improvement of multiple candidate assertions) for advanced LLM-based assertion generation in the near future. Overall, our work underscores the promising future of adopting off-the-shelf LLMs to generate accurate and meaningful assertions in real-world test cases and reduce the manual efforts of unit testing experts in practical scenarios.
- Research Article
- 10.31474/1996-1588-2025-2-41-65-72
- Jan 1, 2025
- Scientific papers of Donetsk National Technical University. Series: Informatics, Cybernetics and Computer Science
"Currently, large language models can generate text in response to input data. They are even starting to show good performance in other tasks. In addition, large language models can be components of models that do more than just generate text. There are well-known projects in which large language models were used to create sentiment detectors, toxicity classifiers, and image captions. The above has led to the interest of various companies in creating large language models, which has contributed to the creation of a significant number of large language models. In this regard, it is very difficult for an ordinary user to navigate the existing variety of large language models. Analysis of recent studies and publications on large language models has shown that, as a rule, they concern one large language model, or a comparative analysis of two large language models, and less often a comparative analysis of several large language models. Among the recent publications devoted to the study of large language models, one can note a publication that groups large language models according to their ease of use by end users. However, the above-mentioned work did not study large language models with which the user cannot interact via a chatbot and which are not available to ordinary users. It should be noted that users of large language models are not only physical users but also companies for which large language models with which the user cannot interact via a chatbot and which are not available to ordinary users, but may be available to the company, may also be interesting and in demand. As a result of the research, the classification of large language models was improved, which will allow different users to better navigate large language models and facilitate the search for the necessary language model. It should be noted that existing large language models are constantly being developed and improved by their developers. In addition, many large well-known companies and their separate divisions are working on the development of new large language models. In this regard, there is a constant need to track these processes and improve the classification of large language models in accordance with their current state."
- Research Article
5
- 10.1109/embc53108.2024.10782119
- Jul 15, 2024
- Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past 30 years, progress toward making high-throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping physician notes.Clinical relevance: Large language models will likely emerge as the dominant method for the high throughput phenotyping of signs and symptoms in physician notes.
- Research Article
101
- 10.1038/s41746-024-01024-9
- Feb 19, 2024
- NPJ Digital Medicine
Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.
- Research Article
- 10.65521/ijacect.v14i3s.1636
- Dec 22, 2025
- International Journal on Advanced Computer Engineering and Communication Technology
PRISMA principles provide a thorough analysis of current advances in large language models (LLMs) and multimodal transformers for medical applications. As LLMs like GPT-4, BioGPT, Med-PaLM, and hybrid frameworks like COMCARE enter clinical processes, thorough synthesis is essential to increase performance, methodological adaptability, and implementation practicality in many healthcare situations. Their creativity in medical report writing, decision support, and diagnosis is notable, but the literature has not established a cohesive taxonomy that evaluates these models by uniform metrics, domain-specific generalizability, and ethical acceptability. Over 40 studies examined radiology report production, clinical question responding, cognitive assessment, and causal reasoning. After testing vision-language transformer architectures like PEGASUS and ETB MII for automated imaging-based reporting, graph-based reasoning was used to evaluate drug safety and interpretability of knowledge- integrated models like KELLM. As needed, BLEU, ROUGE, F1 score, CIDEr, and qualitative evaluations were used. Domain- adapted and hybrid models improve diagnostic accuracy, task- specific explainability, and clinician workload differently. Model illusion, biases, hostile manipulation, and resource-intensive fine- tuning persist. The report recommends strong benchmarking, public evaluation standards, and ethical frameworks for LLMs in high-stakes medical applications. This study defines LLMs' therapeutic utility and recommends infrastructure, ethics, and technology for safe and successful integration. This effort prepares scalable, interpretable, and equitable medical AI systems.