Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests
Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky, i.e., passing and failing across executions, even for the same version of the source code. Flaky test cases introduce overhead to software development as they can lead to unnecessary attempts to debug production or testing code. Besides rerunning test cases multiple times, which is time-consuming and computationally expensive, flaky test cases can be predicted using machine learning (ML) models, thus reducing the wasted cost of re-running and debugging these test cases. However, the state-of-the-art ML-based flaky test case predictors rely on pre-defined sets of features that are either project-specific, i.e., inapplicable to other projects, or require access to production code, which is not always available to software test engineers. Moreover, given the non-deterministic behavior of flaky test cases, it can be challenging to determine a complete set of features that could potentially be associated with test flakiness. Therefore, in this article, we propose Flakify, a black-box, language model-based predictor for flaky test cases. Flakify relies exclusively on the source code of test cases, thus not requiring to (a) access to production code (black-box), (b) rerun test cases, (c) pre-define features. To this end, we employed CodeBERT, a pre-trained language model, and fine-tuned it to predict flaky test cases using the source code of test cases. We evaluated Flakify on two publicly available datasets (FlakeFlagger and IDoFT) for flaky test cases and compared our technique with the FlakeFlagger approach, the best state-of-the-art ML-based, white-box predictor for flaky test cases, using two different evaluation procedures: (1) cross-validation and (2) per-project validation, i.e., prediction on new projects. Flakify achieved F1-scores of 79% and 73% on the FlakeFlagger dataset using cross-validation and per-project validation, respectively. Similarly, Flakify achieved F1-scores of 98% and 89% on the IDoFT dataset using the two validation procedures, respectively. Further, Flakify surpassed FlakeFlagger by 10 and 18 percentage points (pp) in terms of precision and recall, respectively, when evaluated on the FlakeFlagger dataset, thus reducing the cost bound to be wasted on unnecessarily debugging test cases and production code by the same percentages (corresponding to reduction rates of 25% and 64%). Flakify also achieved significantly higher prediction results when used to predict test cases on new projects, suggesting better generalizability over FlakeFlagger. Our results further show that a black-box version of FlakeFlagger is not a viable option for predicting flaky test cases.
- Research Article
2
- 10.1002/stvr.1870
- Jan 11, 2024
- Software Testing, Verification and Reliability
In software evolution, keeping the test code co‐change with the production code is important, because the outdated test code may not work and is ineffective in revealing faults in the production code. However, due to the tight development time, the production and test code may not be co‐changed immediately by developers. For example, we analysed the top 1003 popular Java projects on GitHub and found that nearly 9.3% of cases (i.e., 464,417) did not update their production and test code at the same time, that is, the production code is updated first, and then the test code is updated at intervals. The result indicates that much test code will not be updated in time. In this paper, we propose a novel approach, Jtup, to remind developers to co‐change the production code and test code in time. Specifically, we first define the co‐changed production and test code as a positive instance, while unchanged test code (i.e., production code changed and test code unchanged) as a negative instance. Then, we extract multidimensional features from the production code to characterize the possibility of their co‐change, including code change features, code complexity features, and code semantic features. Finally, several machine learning‐based methods are employed to identify the co‐changed production and test code. We conduct comprehensive experiments on 20 datasets, and the results show that the Accuracy, Precision, and Recall achieved by Jtup are 76.7%, 78.1%, and 77.4%, which outperforms the state‐of‐the‐art method.
- Research Article
17
- 10.1145/3607183
- Sep 30, 2023
- ACM Transactions on Software Engineering and Methodology
Many software processes advocate that the test code should co-evolve with the production code. Prior work usually studies such co-evolution based on production-test co-evolution samples mined from software repositories. A production-test co-evolution sample refers to a pair of a test code change and a production code change where the test code change triggers or is triggered by the production code change. The quality of the mined samples is critical to the reliability of research conclusions. Existing studies mined production-test co-evolution samples based on the following assumption: if a test class and its associated production class change together in one commit, or a test class changes immediately after the changes of the associated production class within a short time interval, this change pair should be a production-test co-evolution sample . However, the validity of this assumption has never been investigated. To fill this gap, we present an empirical study, investigating the reasons for test code updates occurring after the associated production code changes, and revealing the pervasive existence of noise in the production-test co-evolution samples identified based on the aforementioned assumption by existing works. We define a taxonomy of such noise, including six categories (i.e., adaptive maintenance, perfective maintenance, corrective maintenance, indirectly related production code update, indirectly related test code update, and other reasons). Guided by the empirical findings, we propose CHOSEN (an identifi C ation met H od O f production-te S t co- E volutio N ) based on a two-stage strategy. CHOSEN takes a test code change and its associated production code change as input, aiming to determine whether the production-test change pair is a production-test co-evolution sample. Such identified samples are the basis of or are useful for various downstream tasks. We conduct a series of experiments to evaluate our method. Results show that (1) CHOSEN achieves an AUC of 0.931 and an F1-score of 0.928, significantly outperforming existing identification methods, and (2) CHOSEN can help researchers and practitioners draw more accurate conclusions on studies related to the co-evolution of production and test code. For the task of Just-In-Time (JIT) obsolete test code detection, which can help detect whether a piece of test code should be updated when developers modify the production code, the test set constructed by CHOSEN can help measure the detection method’s performance more accurately, only leading to 0.76% of average error compared with ground truth. In addition, the dataset constructed by CHOSEN can be used to train a better obsolete test code detection model, of which the average improvements on accuracy, precision, recall, and F1-score are 12.00%, 17.35%, 8.75%, and 13.50% respectively.
- Conference Article
7
- 10.1109/seaa.2016.51
- Aug 1, 2016
A fundamental goal of software engineering practice is to ensure that code quality is maintained throughout its lifetime. Measuring and maintaining the quality of test code should be as important as measuring production (in-the-field) code. However, test code often seems to be a second class citizen compared to production code in terms of its upkeep and general maintenance. Many of the code features we might expect in test code are either absent or, included when they should not be. In this paper, we investigate four releases of an industrial embedded multi-core system from four perspectives and compare results for test code with corresponding production code. The four perspectives we considered as indicators of code quality. Firstly, we looked at whether test and production code conformed to a set of in-house designated design rules. Secondly, we explored whether test code contained a reasonable proportion of comment to code lines ratio relative to production code. Thirdly, we examined test and production code and the number of assertions in that code. Finally we investigated the relationship between faults and code features. In terms of results, test code did not fare well when compared with production code. An interesting and startling result related to the use of assertions, they were used liberally in test and production code. However, their effect, if triggered, was much larger in production code.
- Conference Article
12
- 10.1109/icst53961.2022.00021
- Apr 1, 2022
Flaky tests are test cases that can pass or fail without code changes. They often waste the time of software developers and obstruct the use of continuous integration. Previous work has presented several automated techniques for detecting flaky tests, though many involve repeated test executions and a lot of source code instrumentation and thus may be both intrusive and expensive. While this motivates researchers to evaluate machine learning models for detecting flaky tests, prior work on the features used to encode a test case is limited. Without further study of this topic, machine learning models cannot perform to their full potential in this domain. Previous studies also exclude a specific, yet prevalent and problematic, category of flaky tests: order-dependent (OD) flaky tests. This means that prior research only addresses part of the challenge of detecting flaky tests with machine learning. Closing this knowledge gap, this paper presents a new feature set for encoding tests, called Flake16. Using 54 distinct pipelines of data preprocessing, data balancing, and machine learning models for detecting both non-order-dependent (NOD) and OD flaky tests, this paper compares Flake16 to another well-established feature set. To assess the new feature set's effectiveness, this paper's experiments use the test suites of 26 Python projects, consisting of over 67,000 tests. Along with identifying the most impactful metrics for using machine learning to detect both types of flaky test, the empirical study shows how Flake16 is better than prior work, including (1) a 13% increase in overall F1 score when detecting NOD flaky tests and (2) a 17% increase in overall F1 score when detecting OD flaky tests.
- Research Article
12
- 10.1109/tse.2024.3472476
- Dec 1, 2024
- IEEE Transactions on Software Engineering
Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky tests where the root cause of flakiness is in the test itself and not in the production code. One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, we augment the prompts of GPT 3.5 Turbo, a Large Language Model (LLM), with such extra knowledge to request repair suggestions. The results show that our suggested fix category labels, complemented with in-context learning, significantly enhance the capability of GPT 3.5 Turbo in generating fixes for flaky tests. Based on the execution and analysis of a sample of GPT-repaired flaky tests, we estimate that a large percentage of such repairs, (roughly between 51% and 83%) can be expected to pass. For the failing repaired tests, on average, 16% of the test code needs to be further changed for them to pass.
- Research Article
3
- 10.1109/access.2025.3553626
- Jan 1, 2025
- IEEE Access
Software development is significantly impeded by flaky tests, which intermittently pass or fail without requiring code modifications, resulting in a decline in confidence in automated testing frameworks. Code smells (i.e., test case or production code) are the primary cause of test flakiness. In order to ascertain the prevalence of test smells, researchers and practitioners have examined numerous programming languages. However, one isolated experiment was conducted, which focused solely on one programming language. Across a variety of programming languages, such as Java, Python, C++, Go, and JavaScript, this study examines the predictive accuracy of a variety of machine learning classifiers in identifying flaky tests. We compare the performance of classifiers such as Random Forest, Decision Tree, Naive Bayes, Support Vector Machine, and Logistic Regression in both single-language and cross-language settings. In order to ascertain the impact of linguistic diversity on the flakiness of test cases, models were trained on a single language and subsequently tested on a variety of languages. The following key findings indicate that Random Forest and Logistic Regression consistently outperform other classifiers in terms of accuracy, adaptability, and generalizability, particularly in cross-language environments. Additionally, the investigation contrasts our findings with those of previous research, exhibiting enhanced precision and accuracy in the identification of flaky tests as a result of meticulous classifier selection. We conducted a thorough statistical analysis, which included t-tests, to assess the importance of classifier performance differences in terms of accuracy and F1-score across a variety of programming languages. This analysis emphasizes the substantial discrepancies between classifiers and their effectiveness in detecting flaky tests. The datasets and experiment code utilized in this study are accessible through an open source GitHub repository to facilitate reproducibility is available at: <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/PELAB-LiU/FlakyCrossLanguage</uri>. Our results emphasize the effectiveness of probabilistic and ensemble classifiers in improving the reliability of automated testing, despite certain constraints, including the potential biases introduced by language-specific structures and dataset variability. This research provides developers and researchers with practical insights that can be applied to the mitigation of flaky tests in a variety of software environments.
- Conference Article
23
- 10.1109/saner50967.2021.00033
- Mar 1, 2021
Software products frequently evolve. When the production code undergoes major changes such as feature addition or removal, the corresponding test code typically should co-evolve. Otherwise, the outdated test may be ineffective in revealing faults or cause spurious test failures, which could confuse developers and waste QA resources. Despite its importance, maintaining such co-evolution can be time- and resource-consuming. Existing work has disclosed that, in practice, test code often fails to co-evolve with the production code. To facilitate the co-evolution of production and test code, this work explores how to automatically identify outdated tests. To gain insights into the problem, we conducted an empirical study on 975 open-source Java projects. By manually analyzing and comparing the positive cases, where the test code co-evolves with the production code, and the negative cases, where the co-evolution is not observed, we found that various factors (e.g., the different language constructs modified in the production code) can determine whether the test code should be updated. Guided by the empirical findings, we proposed a machine-learning based approach, SITAR, that holistically considers different factors to predict test changes. We evaluated SITAR on 20 popular Java projects. These results show that SITAR, under the within-project setting, can reach an average precision and recall of 81.4% and 76.1%, respectively, for identifying test code that requires update, which significantly outperforms rule-based baseline methods. SITAR can also achieve promising results under the cross-project setting and multiclass prediction, which predicts the exact change types of test code.
- Research Article
1
- 10.1145/3759915
- Aug 12, 2025
- ACM Transactions on Software Engineering and Methodology
Log statements play a critical role in modern software development, capturing essential runtime information necessary for software maintenance. Recently, new techniques have been developed to automate logging activities, allowing log statements to be injected into code by identifying specific code locations, selecting the appropriate log level, and generating meaningful log messages that describe the behavior being logged. Although automated logging in production code has attracted significant attention, little focus has been given to the injection of logs in test code. To fill this gap, we conduct an empirical study on 5,206,759 Java test methods collected from 6,405 GitHub projects to explore and disclose the effectiveness and limitations of Pre-trained Language Models (PLMs) and Large Language Models (LLMs) for generating and injecting test log statements. Our findings demonstrate that general-purpose LLMs like GPT-3.5-Turbo, when properly instructed to inject logging statements in test methods, performs comparably to the best-performing PLMs on predicting log level. Additionally, GPT-3.5-Turbo substantially outperforms the best in PLMs on predicting log position with a 33.97% improvement while also achieving superior performance in predicting log messages in terms of BLEU and ROUGE . This work takes the first step toward evaluating the capability of PLMs and LLMs to generate test log statement. This work takes the first step toward evaluating the capability of PLMs and LLMs to generate test log statements. To facilitate future research, we have open-sourced all data and source code used in this work.
- Research Article
4
- 10.21203/rs.3.rs-4126574/v1
- Mar 21, 2024
- Research Square
Early and accurate diagnosis is crucial for effective treatment and improved outcomes, yet identifying psychotic episodes presents significant challenges due to its complex nature and the varied presentation of symptoms among individuals. One of the primary difficulties lies in the underreporting and underdiagnosis of psychosis, compounded by the stigma surrounding mental health and the individuals’ often diminished insight into their condition. Existing efforts leveraging Electronic Health Records (EHRs) to retrospectively identify psychosis typically rely on structured data, such as medical codes and patient demographics, which frequently lack essential information. Addressing these challenges, our study leverages Natural Language Processing (NLP) algorithms to analyze psychiatric admission notes for the diagnosis of psychosis, providing a detailed evaluation of rule-based algorithms, machine learning models, and pre-trained language models. Additionally, the study investigates the effectiveness of employing keywords to streamline extensive note data before training and evaluating the models. Analyzing 4,617 initial psychiatric admission notes (1,196 cases of psychosis versus 3,433 controls) from 2005 to 2019, we discovered that the XGBoost classifier employing Term Frequency-Inverse Document Frequency (TF-IDF) features derived from notes pre-selected by expert-curated keywords, attained the highest performance with an F1 score of 0.8881 (AUROC [95% CI]: 0.9725 [0.9717, 0.9733]). BlueBERT demonstrated comparable efficacy an F1 score of 0.8841 (AUROC [95% CI]: 0.97 [0.9580,0.9820]) on the same set of notes. Both models markedly outperformed traditional International Classification of Diseases (ICD) code-based detection methods from discharge summaries, which had an F1 score of 0.7608, thus improving the margin by 0.12. Furthermore, our findings indicate that keyword pre-selection markedly enhances the performance of both machine learning and pre-trained language models. This study illustrates the potential of NLP techniques to improve psychosis detection within admission notes and aims to serve as a foundational reference for future research on applying NLP for psychosis identification in EHR notes.
- Research Article
- 10.1101/2024.03.18.24304475
- Mar 19, 2024
- medRxiv
Early and accurate diagnosis is crucial for effective treatment and improved outcomes, yet identifying psychotic episodes presents significant challenges due to its complex nature and the varied presentation of symptoms among individuals. One of the primary difficulties lies in the underreporting and underdiagnosis of psychosis, compounded by the stigma surrounding mental health and the individuals’ often diminished insight into their condition. Existing efforts leveraging Electronic Health Records (EHRs) to retrospectively identify psychosis typically rely on structured data, such as medical codes and patient demographics, which frequently lack essential information. Addressing these challenges, our study leverages Natural Language Processing (NLP) algorithms to analyze psychiatric admission notes for the diagnosis of psychosis, providing a detailed evaluation of rule-based algorithms, machine learning models, and pre-trained language models. Additionally, the study investigates the effectiveness of employing keywords to streamline extensive note data before training and evaluating the models. Analyzing 4,617 initial psychiatric admission notes (1,196 cases of psychosis versus 3,433 controls) from 2005 to 2019, we discovered that the XGBoost classifier employing Term Frequency-Inverse Document Frequency (TF-IDF) features derived from notes pre-selected by expert-curated keywords, attained the highest performance with an F1 score of 0.8881 (AUROC [95% CI]: 0.9725 [0.9717, 0.9733]). BlueBERT demonstrated comparable efficacy an F1 score of 0.8841 (AUROC [95% CI]: 0.97 [0.9580, 0.9820]) on the same set of notes. Both models markedly outperformed traditional International Classification of Diseases (ICD) code-based detection methods from discharge summaries, which had an F1 score of 0.7608, thus improving the margin by 0.12. Furthermore, our findings indicate that keyword pre-selection markedly enhances the performance of both machine learning and pre-trained language models. This study illustrates the potential of NLP techniques to improve psychosis detection within admission notes and aims to serve as a foundational reference for future research on applying NLP for psychosis identification in EHR notes.
- Research Article
- 10.1038/s41398-025-03629-4
- Nov 29, 2025
- Translational Psychiatry
Early and accurate diagnosis is crucial for effective treatment and improved outcomes, yet identifying psychotic episodes presents significant challenges due to its complex nature and the varied presentation of symptoms among individuals. One of the primary difficulties lies in the underreporting and underdiagnosis of psychosis, compounded by the stigma surrounding mental health and the individuals’ often diminished insight into their condition. Existing efforts leveraging Electronic Health Records (EHRs) to retrospectively identify psychosis typically rely on structured data, such as medical codes and patient demographics, which frequently lack essential information. Addressing these challenges, our study leverages Natural Language Processing (NLP) algorithms to analyze psychiatric admission notes for the diagnosis of psychosis, providing a detailed evaluation of rule-based algorithms, machine learning models, and pre-trained language models. Additionally, the study investigates the effectiveness of employing keywords to streamline extensive note data before training and evaluating the models. Analyzing 4629 initial psychiatric admission notes (1196 cases of psychosis versus 3433 controls) from 2005 to 2019, including patients aged 16–35 years, selected based on the 75th percentile for age at onset of schizophrenia, we discovered that the XGBoost classifier employing Term Frequency-Inverse Document Frequency (TF-IDF) features derived from notes pre-selected by expert-curated keywords, attained the highest performance with an F1 score of 0.8881 (AUROC [95% CI]: 0.9725 [0.9717, 0.9733]). BlueBERT demonstrated comparable efficacy with an F1 score of 0.8841 (AUROC [95% CI]: 0.97 [0.9580, 0.9820]) on the same set of notes. Both models markedly outperformed traditional International Classification of Diseases (ICD) code-based detection methods from discharge summaries, which had an F1 score of 0.7608, thus improving the margin by 0.12. Furthermore, our findings indicate that keyword pre-selection markedly enhances the performance of both machine learning and pre-trained language models. This study illustrates the potential of NLP techniques to improve psychosis detection within admission notes and aims to serve as a foundational reference for future research on applying NLP for psychosis identification in EHR notes.
- Research Article
35
- 10.1109/access.2022.3211313
- Jan 1, 2022
- IEEE Access
Intrusion detection is an essential task for protecting the cyber environment from attacks. Many studies have proposed sophisticated models to detect intrusions from a large amount of data, yet they ignore the fact that poor data quality has a direct impact on the performance of the intrusion detection systems. This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems. Then the data preparation workflow and data quality requirements for intrusion detection are discussed. In order to investigate how data quality affects model performance, we conducted experiments on 11 HIDS datasets using eight machine learning (ML) models and two pre-trained language models (BERT and GPT-2). The experimental results show: 1. BERT and GPT outperform the other models on all of the datasets. 2. The pre-trained models and the classic ML models behave differently when duplicate data and overlapped data are removed from a dataset. The pre-trained models are more capable of learning from duplicate and overlapped data compared to the classic ML models. 3. Removing overlaps and duplicates can improve the performances of the pre-trained models and the traditional ML models on most datasets used in this study. However, doing this can sometimes cause model performance to be decreased. 4. The reliability of model performance is affected when a testing data contain duplicates. 5. The overlapped rate between the normal and intrusion classes seems to have an inverse relationship to the pre-trained models’ performances on the intrusion detection task. Given the results, we discuss model selection in HIDS, and quality assurance in training data and testing data based on nine data quality dimensions.
- Conference Article
138
- 10.1145/3293882.3330570
- Jul 10, 2019
In today’s agile world, developers often rely on continuous integration pipelines to help build and validate their changes by executing tests in an efficient manner. One of the significant factors that hinder developers’ productivity is flaky tests—tests that may pass and fail with the same version of code. Since flaky test failures are not deterministically reproducible, developers often have to spend hours only to discover that the occasional failures have nothing to do with their changes. However, ignoring failures of flaky tests can be dangerous, since those failures may represent real faults in the production code. Furthermore, identifying the root cause of flakiness is tedious and cumbersome, since they are often a consequence of unexpected and non-deterministic behavior due to various factors, such as concurrency and external dependencies. As developers in a large-scale industrial setting, we first describe our experience with flaky tests by conducting a study on them. Our results show that although the number of distinct flaky tests may be low, the percentage of failing builds due to flaky tests can be substantial. To reduce the burden of flaky tests on developers, we describe our end-to-end framework that helps identify flaky tests and understand their root causes. Our framework instruments flaky tests and all relevant code to log various runtime properties, and then uses a preliminary tool, called RootFinder, to find differences in the logs of passing and failing runs. Using our framework, we collect and publicize a dataset of real-world, anonymized execution logs of flaky tests. By sharing the findings from our study, our framework and tool, and a dataset of logs, we hope to encourage more research on this important problem.
- Research Article
- 10.55041/ijsrem9725
- Jul 16, 2021
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Background & Problem Statement - Software testing is a critical phase in the software development lifecycle (SDLC), ensuring that applications function correctly, meet user requirements, and maintain high- quality standards. Traditional software testing approaches, including manual testing and rule-based automation, often face challenges in scalability, efficiency, and adaptability to dynamic software environments. Traditional testing methods are overwhelmed by complex software systems which slows down defect detection and extends both testing costs and release schedules. Machine Learning (ML) has emerged as a transformative solution, introducing predictive and adaptive capabilities that optimize test case selection, automate defect detection, and enhance overall software quality assurance (QA). This study explores the integration of ML in software testing, addressing the challenges of traditional QA methodologies and demonstrating how AI-driven frameworks improve testing efficiency. Methodology - To investigate the impact of ML in software testing, this research adopts a systematic approach by analyzing ML-driven test automation techniques, including predictive testing, adaptive test execution, and automated test case generation. Research reviews how Google Microsoft Facebook IBM and Deep Code put ML-based quality assurance frameworks into operation. The study leverages supervised learning, reinforcement learning, deep learning, and NLP-based techniques to demonstrate how ML models predict software defects, dynamically adapt test cases, and optimize testing resources. The research tests how ML-based testing models operate within CI/CD pipelines to improve ongoing testing and deployment flow. Analysis & Results - The analysis of ML-driven software testing reveals that predictive analytics improves early defect detection rates. It helps developers spend 37% less time debugging their work. Adaptive testing models, including self-healing test scripts, minimize maintenance costs by 50% and enhance test reliability in agile environments. The integration of NLP-based test case generation increases test coverage. NLP technology enables automatic connection between requirements and test cases at 89% success rate. Additionally, reinforcement learning techniques improve test case selection, reducing redundant test executions by 43%. Our research shows different ML methods work well to lessen incorrect error alerts. ML integration for QA surely increasing defect prediction accuracy and optimizing test execution time. Findings & Contributions - This research contributes to the field of AI-driven software testing by providing a comprehensive framework for ML-based QA methodologies. Our study shows that machine learning helps find more software problems better adapts test cases and lowers testing expenses to solve present software development needs. The study also identifies critical challenges, including data availability, model interpretability, and computational overhead, suggesting future research directions in Explainable AI (XAI), hybrid AI-ML testing models, and AI-driven security testing. As the industry moves toward AI-first software testing, this research paves the way for fully autonomous QA frameworks, enabling intelligent, scalable, and cost- effective software validation techniques. Keywords - Machine Learning, Software Testing, Quality Assurance, Predictive Testing, Adaptive Testing, Test Automation, Defect Prediction, Self- Healing Test Scripts, AI-Driven QA, Reinforcement
- Conference Article
1
- 10.1145/3609437.3609449
- Aug 4, 2023
As production code evolves, test code can quickly become outdated. When test code is outdated, it may fail to capture errors in the programs under test and can lead to serious software bugs that result in significant losses for both developers and users. To ensure high software quality, it is crucial to promptly update the test code after making changes to the production code. This practice ensures that the test code and production code evolve together, reducing the likelihood of errors and ensuring the software remains reliable. However, maintaining test code can be challenging and time-consuming. To automate the identification of outdated test code, recent research has proposed Sitar, a machine learning-based method. Despite Sitar’s usefulness, it has major limitations, including its coarse prediction granularity (at class level), reliance on naming conventions to discover test code, and dependence on manually summarized features to construct machine learning models.