Towards More Precise Coincidental Correctness Detection With Deep Semantic Learning
Coincidental correctness (CC) is a situation during the execution of a test case, the buggy entity is executed, but the program behaves correctly as expected. Many automated fault localization (FL) techniques use runtime information to discover the underlying connection between the executed buggy entity and the failing test result. The existence of CC will weaken such connection, mislead the FL algorithms to build inaccurate models, and consequently, decrease the localization accuracy. To alleviate the adverse effect of CC on FL, CC detection techniques have been proposed to identify the possible CC tests via heuristic or machine learning algorithms. However, their performance on precision is not satisfactory since they overestimate the possible CC tests and are insufficient in learning the deep semantic features. In this work, we propose a novel <u>Tri</u>plet network-based <u>Co</u>incidental <u>Co</u>rrectness detection technique (<i>i.e.,</i> <b>TriCoCo</b>) to overcome the limitations of the prior works. <b>TriCoCo</b> narrows the possible CC tests by designing three features to identify genuine passing tests. Instead of using all tests as inputs by existing techniques, <b>TriCoCo</b> takes the identified genuine passing tests and failing ones to train a triplet model that can evaluate their relative distance. Finally, <b>TriCoCo</b> infers the probability of being a CC test of the test in the rest of the passing tests by using the trained triplet model. We conduct large-scale experiments to evaluate <b>TriCoCo</b> based on the widely-used Defects4J benchmark. The results demonstrate that <b>TriCoCo</b> can improve not only the precision of CC detection but also the effectiveness of FL techniques, <i>e.g.,</i> the precision of <b>TriCoCo</b> is 80.33<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula> on average, and <b>TriCoCo</b> boosts the efficacy of DStar by 18<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>–74<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula> in terms of MFR metric when compared to seven state-of-the-art CC detection baselines.
- Conference Article
17
- 10.1109/iccis.2012.361
- Aug 1, 2012
In order to improve efficiency of debugging, many fault localization techniques have been proposed to find out the program entities that are likely to contain faults. However, recent researches indicate that the effectiveness of fault localization techniques suffers from occurrences of coincidental correctness, which means execution result of test cases that exercise faulty statements indicate no failure information. This paper presents a strategy using cluster analysis to identify coincidental correctness in test sets for fault localization. Test cases that exercise same faulty statements are expected to be grouped together by cluster analysis, and then during debugging these tests that are identified to contain coincidental correctness can be used to improve effectiveness of fault localization techniques. To evaluate our technique, we conducted an experiment on some Siemens Suit programs. The experimental results show that the strategy is effective at automatically identifying coincidental correct tests.
- Research Article
- 10.1002/spe.2104
- Jan 16, 2012
- Software: Practice and Experience
Software systems today are large and complex. At the same time, the time to market is extremely short because of competition. As a result, program debugging for real-life systems is very difficult. In general, the debugging process consists of three tasks, namely, fault localization, fault repair, and retesting. In particular, fault localization is generally considered to be the most challenging. It is recognized as time-consuming and tedious if conducted manually. On the other hand, formal methods suffer from scalability problems, and static techniques are imprecise. Automatic statistical fault localization techniques are regarded as the most promising option. They compare passed and failed executions of a faulty program and produce a suspiciousness ranking of program entities (such as statements or predicates). Developers may then follow up with the list sequentially to identify program faults. Unfortunately, although a large number of statistical fault localization techniques are available, they have not reached the maturity to pinpoint accurately and precisely the locations of faults. Also, the recording and replaying of passed and failed executions as well as fault repair without introducing new bugs remain unresolved issues. Furthermore, researchers often make unrealistic assumptions, and software subjects under study do not necessarily reflect the fault characteristics of large industrial applications. There is plenty of room for improvement. The 2nd International Workshop on Program Debugging (IWPD 2011) was a full day workshop held in conjunction with the 35th Annual International Computer Software and Applications Conference (COMPSAC 2011) in Munich, Germany in July 2011. It serves as a platform for researchers and practitioners to exchange ideas, present new advancements, and identify further challenges in program debugging. It brings to light the latest challenges and advances in research and practice in program debugging, with a special emphasis on methodology, technology, and environment. Two keynote speeches were given by internationally renowned researchers — T. Y. Chen of Swinburne University of Technology, Australia and W. K. Chan of City University of Hong Kong, Hong Kong. There were also sessions for paper presentations and panel discussions. We shortlisted three papers from the workshop and invited the authors to submit an extended version to Software: Practice and Experience. Two papers were accepted for this focus section after going through two rounds of rigorous reviews involving two to three anonymous reviewers for each article. Both accepted papers address the important area of statistical fault location. The first paper, entitled ‘In quest of the science in statistical fault localization’ by W. K. Chan and Yan Cai, is an extended version of the keynote speech delivered by the first author in IWPD 2011. A vital element in research is to know the shortcomings of the current state of the art. In this paper, the authors conduct a critical review of existing work on statistical fault localization (including their own), highlight misconceptions and unnecessary assumptions, and provide remedial measures to rectify such malpractices. The authors point out that a lot of current research in statistical fault localization does not consider coincidental correctness, which means that the execution of a faulty statement may not necessarily lead to a program failure, even though this important concept has been known to software testers for decades. Also, existing fault localization techniques compare the similarities and dissimilarities between passed and failed executions to locate faults. These similarity coefficients estimate the probability that a particular program entity causes a failure, but ignore the noise caused by other entities. The authors point out the importance of a noise-reduction mechanism for the similarity coefficients. Another issue is that existing researchers often assume that they are dealing with large samples, where the central limit theorem applies. Empirical studies by the authors show that this assumption is often invalid. It is unrealistic to expect the availability of execution profiles with thousands of test verdicts for the average programs. A developer needs to debug a program even if a small number of failures have been revealed. When the number of samples is small, nonparametric statistical techniques should be applied. The authors conclude the paper by giving an insightful summary of the challenges in statistical fault localization that may benefit researchers in software engineering and related software areas. The second paper is entitled ‘A consensus-based strategy to improve the quality of fault localization’ by Vidroha Debroy and W. Eric Wong. Quite a number of statistical fault localization techniques have been proposed. Each of them claims to be superior to others in one aspect or another using different data sets. There is, however, no single technique that is definitely better than others in all aspects. In this paper, the authors put forward an integrated approach to address the issue. Rather than proposing yet another new technique that captures the more promising features of existing techniques, the authors propose a consensus-based strategy, which combines the rankings of several techniques. Using the Borda method, a consolidated ranking is produced by integrating various statement rankings that result from individual techniques. The scale of the proposed approach can be easily extended or retracted because new fault localization techniques can be added by the inclusion of their rankings, or existing techniques can be excluded by the removal of their rankings. Also, because different techniques operate on the same input data set, the overhead of the consensus is minimal. The overall ranking can be determined in linear time. The effectiveness of the consensus-based approach has been validated using three popular fault localization techniques (Tarantula, Ochiai, and H3) on the Siemens suite of programs as well as the Ant, grep, gzip, make, and space programs. The empirical study shows that the performance of the proposed approach is close to the best results of the techniques under study. Finally, I would like to thank Professor Nigel Horspool and Professor Andy Wellings, Editors of Software: Practice and Experience, for kindly agreeing to publish this focus section.
- Research Article
14
- 10.1002/stvr.1762
- Jan 9, 2021
- Software Testing, Verification and Reliability
According to the reachability–infection–propagation (RIP) model, three conditions must be satisfied for program failure to occur: (1) the defect's location must bereached, (2) the program's state must becomeinfectedand (3) the infection mustpropagateto the output.Weak coincidental correctness(orweak CC) occurs when the program produces the correct output, while condition (1) is satisfied but conditions (2) and (3) are not satisfied.Strong coincidental correctness(orstrong CC) occurs when the output is correct, while both conditions (1) and (2) are satisfied but not (3). The prevalence ofCCwas previously recognized. In addition, the potential for its negative effect on spectrum‐based fault localization (SBFL) was analytically demonstrated; however, this was not empirically validated. UsingDefects4J, this paper empirically studies the impact ofweakandstrong CCon three well‐researched coverage‐based fault detection and localization techniques, namely, test suite reduction (TSR), test case prioritization (TCP) and SBFL. Our study, which involved 52 SBFL metrics, provides the following empirical evidence. (i) The negative impact ofCCtests on TSR and TCP is very significant. In addition, cleansing theCCtests was observed to yield (a) a 100% TSR defect detection rate for all subject programs and (b) an improvement of TCP for over 92% of the subjects. (ii) The impact ofCCtests on SBFL varies widely w.r.t. the metric used. The negative impact was strong for 11 metrics, mild for 37, non‐measurable for 1 and non‐existent for 3 metrics. Interestingly, the negative impact was mild for the 9 most popular and/or most effective SBFL metrics. In addition, cleansing theCCtests resulted in the deterioration of SBFL for a considerable number of subject programs. (iii) Increasing the proportion ofCCtests has a limited impact on TSR, TCP and SBFL. Interestingly, for TSR and TCP and 11 SBFL metrics, small and large proportions ofCCtests are strongly harmful. (iv) Lastly,weakandstrong CCare equally detrimental in the context of TSR, TCP and SBFL.
- Conference Article
15
- 10.1109/saner53432.2022.00045
- Mar 1, 2022
Automated fault localization (FL) techniques collect runtime information as input data and then analyze input data to identify the relationship between program statements and failures. They usually take advantages of the statistics of the input data to develop a suspiciousness evaluation methodology (e.g., spectrum-based formulas and deep neural network models) by exploring the underlying correlation rooted in the input data. Thus, the quality of input data is critical for FL. In the actual process of development, developers seek to generate adequate test cases for testing the function or the robustness of a subject program. However, regarding a fault, most test cases are passed test cases and a very few ones are failed test cases since a very small portion of inputs in input domain will lead to a program failure. It means that FL usually faces a problem of imbalanced data, and this problem has been proven to pose an adverse effect on FL effectiveness. To address this problem, we propose BCL-FL: a data augmentation approach based on between-class learning, which produces new synthesized failed test samples by mixing two classes of real test cases (i.e., a passed test case and a failed one) with a random ratio. Specifically, BCL-FL uses the characteristics of real failed test cases to design a data synthesis formula suitable for failed test samples, which can make the synthesized failed test samples closer to real test cases. Since the synthesized data is different from real data, we ingeniously assign a continuous value between 0 and 1 to label the synthesized sample according to the mixing ratio of original labels. We take the synthesized failed test samples and the original test cases as the balanced input data for FL techniques to address the imbalanced data problem. To evaluate the effectiveness of BCL-FL, we conduct large-scale experiments on 287 faulty versions of eight large-sized programs (from ManyBugs and Defects4J) using six state-of-the-art FL approaches. The experimental results show that BCL-FL significantly improves the effectiveness of existing FL techniques, e.g., BCL-FL improves the CNN-FL approach in Top-1, Top-5, and Top-10 by 150%, 136.36%, and 193.1%, respectively.
- Research Article
- 10.1109/tr.2026.3668421
- Jan 1, 2026
- IEEE Transactions on Reliability
Coincidental Correctness (CC) arises when a test case executes faulty entity in a program without causing a failure. This phenomenon injects noise into coverage information, as CC tests weaken the connection between faulty entities and test failures. Since many fault localization (FL) approaches relies on analyzing test execution traces to locate faulty entities, the compromised reliability of test results directly undermines FL accuracy. Furthermore, the detrimental effects of CC extend beyond fault localization to subsequent software maintenance tasks like automatic program repair. Therefore, identifying and mitigating CC tests becomes critical not only for enhancing FL but also for ensuring robust software quality assurance. Thus, we propose FusionCC: an approach that applies multiscale coverage features and handcrafted features to fuse complementary feature representations for CC test case detection. Specifically, FusionCC first refines original coverage data by filtering out noisy irrelevant elements, then extracts multiscale features from the refined matrix, and finally fuses the coverage and handcrafted features to generate highly informative feature representations for CC detection. FusionCC realizes a comprehensive fusion of complementary features across different scales and from diverse sources, which significantly enhances the accuracy of CC detection. To evaluate the effectiveness of FusionCC, we conduct large-scale experiments on 277 faulty versions of six representative benchmarks. The experimental results show that FusionCC significantly improves CC detection (e.g., average improvements of 50.93% precision and 82.03% in <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$F_{1}$</tex-math></inline-formula> value compared to state-of-the-art CC detection approaches) and fault localization effectiveness (e.g., 10.33, 19.33, 25.67 average faults can be found in terms of Top-1, Top-3, Top-5 metrics at relabel strategy compared with state-of-the-art FL approaches).
- Research Article
8
- 10.1016/j.jss.2023.111900
- Nov 18, 2023
- Journal of Systems and Software
Trace matrix optimization for fault localization
- Conference Article
136
- 10.1109/icse.2009.5070507
- Jan 1, 2009
Recent techniques for fault localization leverage code coverage to address the high cost problem of debugging. These techniques exploit the correlations between program failures and the coverage of program entities as the clue in locating faults. Experimental evidence shows that the effectiveness of these techniques can be affected adversely by coincidental correctness, which occurs when a fault is executed but no failure is detected. In this paper, we propose an approach to address this problem. We refine code coverage of test runs using control- and data-flow patterns prescribed by different fault types. We conjecture that this extra information, which we call context patterns, can strengthen the correlations between program failures and the coverage of faulty program entities, making it easier for fault localization techniques to locate the faults. To evaluate the proposed approach, we have conducted a mutation analysis on three real world programs and cross-validated the results with real faults. The experimental results consistently show that coverage refinement is effective in easing the coincidental correctness problem in fault localization techniques.
- Conference Article
4
- 10.1109/issrew.2013.6688889
- Nov 1, 2013
In software debugging, statistical fault localization techniques contrast dynamic spectra of program elements to estimate the location of faults in faulty programs. Coincidental correctness may have a negative impact on these techniques because faults can also be triggered in an observed non-failed run and thus disturbs the assessment of fault locations. However, eliminating the confounding relies on the accuracy of recognizing them. This paper makes use of the presence of coincidental correctness as an effective interface to the success of fault localization. We calculate the distribution overlapping of dynamic spectrum in failed runs and in non-failed runs to find out the fault-leading predicates, and further reduce the region by referencing the inter-class distances of the spectra to suppress the less suspicious candidates. Empirical results show that our technique can outperform representative existing predicate-based fault localization techniques.
- Conference Article
6
- 10.1109/saner56733.2023.00018
- Mar 1, 2023
Fault localization seeks to locate the suspicious statements possible for causing a program failure. Experimental evidence shows that fault localization effectiveness is affected adversely by the existence of coincidental correctness (CC) test cases, where a CC test case denotes the test case which executes a fault but no failure occurs. Even worse, CC test cases are prevailing in realistic testing and debugging, leading to a severe issue on fault localization effectiveness. Thus, it is indispensable to accurately detect CC test cases and alleviate their harmful effect on fault localization effectiveness.To address this problem, we propose NeuralCCD: a neural coincidental correctness detection approach by integrating multiple features. Specifically, NeuralCCD first leverages suspiciousness score, coverage ratio and similarity to define three CC detection features. Based on these CC detection features and CC labels, NeuralCCD utilizes multi-layer perceptron to learn a different feature-based model for a program, and finally combine the trained models of different programs as an ensemble system to detect CC test cases. To evaluate the effectiveness of NeuralCCD, we conduct large-scale experiments on 247 faulty version of five representative benchmarks and compare NeuralCCD with four state-of-the-art CC detection approaches. The experimental results show that NeuralCCD significantly improves the effectiveness of CC detection, e.g., NeuralCCD yields by at most 109.5%, 93% and 81.3% improvement of Top-1, Top-3 and Top-5 over Tech-I when utilized in Dstar formular.
- Research Article
24
- 10.1016/j.infsof.2018.11.009
- Nov 30, 2018
- Information and Software Technology
VFL: Variable-based fault localization
- Research Article
54
- 10.1007/s11219-016-9312-z
- Mar 26, 2016
- Software Quality Journal
Automated program repair (APR) tools apply fault localization (FL) techniques to identify the locations of likely faults to be repaired. The effectiveness, performance, and repair correctness of APR depends in part on the FL method used. If FL does not identify the location of a fault, the application of an APR tool will not be effective--it will fail to repair the fault. If FL assigns the actual faulty statement a low priority for repair, APR performance will be reduced by increasing the time required to find a potential repair. In addition, the correctness of a generated repair will be decreased since APR will modify fault-free statements that are assigned a higher priority for repair than an actual faulty statement. We conducted a controlled experiment to evaluate the impact of ten FL techniques on APR effectiveness, performance, and repair correctness using a brute force APR tool applied to faulty versions of the Siemens Suite and two other large programs: space and sed. All FL techniques were effective in identifying all faults; however, Wong3 and Ample1 were the least effective FL techniques since they assigned the lowest priority for repair in more than 26 % of the trials. We obtained the worst APR performance significantly when Ample1 was used since it generated a large number of variants in 29.11 % of the trials, and took the longest time to produce potential repairs. Jaccard FL improved repair correctness by generating more validated repairs---potential repairs that pass a set of regression tests, and generating potential repairs that failed fewer regression tests. Also Jaccard's performance is noteworthy in that it never generated a large number of variants during the repair process compared to the alternatives.
- Conference Article
71
- 10.1109/qsic.2010.80
- Jul 1, 2010
Fault localization is one of the most expensive activities of program debugging, which is why the recent years have witnessed the development of many different fault localization techniques. This paper proposes a grouping-based strategy that can be applied to various techniques in order to boost their fault localization effectiveness. The applicability of the strategy is assessed over – Tarantula and a radial basis function neural network-based technique; across three different sets of programs (the Siemens suite, grep and gzip). Results are suggestive that the grouping-based strategy is capable of significantly improving the fault localization effectiveness and is not limited to any particular fault localization technique. The proposed strategy does not require any additional information than what was already collected as input to the fault localization technique, and does not require the technique to be modified in any way.
- Research Article
50
- 10.1145/3345628
- Oct 9, 2019
- ACM Transactions on Software Engineering and Methodology
Finding the root cause of a bug requires a significant effort from developers. Automated fault localization techniques seek to reduce this cost by computing the suspiciousness scores (i.e., the likelihood of program entities being faulty). Existing techniques have been developed by utilizing input features of specific types for the computation of suspiciousness scores, such as program spectrum or mutation analysis results. This article presents a novel learn-to-rank fault localization technique called PRecise machINe-learning-based fault loCalization tEchnique (PRINCE). PRINCE uses genetic programming (GP) to combine multiple sets of localization input features that have been studied separately until now. For dynamic features, PRINCE encompasses both Spectrum Based Fault Localization (SBFL) and Mutation Based Fault Localization (MBFL) techniques. It also uses static features, such as dependency information and structural complexity of program entities. All such information is used by GP to train a ranking model for fault localization. The empirical evaluation on 65 real-world faults from CoREBench, 84 artificial faults from SIR, and 310 real-world faults from Defects4J shows that PRINCE outperforms the state-of-the-art SBFL, MBFL, and learn-to-rank techniques significantly. PRINCE localizes a fault after reviewing 2.4% of the executed statements on average (4.2 and 3.0 times more precise than the best of the compared SBFL and MBFL techniques, respectively). Also, PRINCE ranks 52.9% of the target faults within the top ten suspicious statements.
- Conference Article
3
- 10.1109/issre59848.2023.00074
- Oct 9, 2023
A test suite is indispensable for fault localization by providing useful execution information of its test cases for locating suspicious statements of being faulty. There exists a type of test cases known as coincidental correctness (CC) test cases, which executes the faulty statement whereas produces the anticipated output. The existing studies have shown CC test cases harmfully impact fault localization effectiveness. Therefore, it is crucial to detect CC test cases to mitigate the adverse impact of CC test cases on fault localization.To address this issue, we propose ContraCC: a CC test cases detection method using contrastive learning. The insight of ContraCC is that the internal structural information of source test case execution data should be beneficial for CC detection whereas there is a lack of suitable representation methods. Inspired by the insight, ContraCC uses contrastive learning to learn new differentiated representations as test case vectors, which differentiate between similar and dissimilar pairs of test cases by maximizing their similarity within the same class and minimizing it between different classes. Based on the contrastive learning representations (i.e., test case vectors), ContraCC adopts multi-layer perceptron for binary classification to detect CC in downstream tasks. To evaluate the effectiveness of ContraCC, we conduct large-scale experiments on widely-used benchmarks by comparing ContraCC with five state-of-the-art CC test cases detection methods and applying ContraCC for fault localization. The experimental results show that ContraCC outperforms four state-of-the-art methods (e.g., from 10% to 84% improvement in Top-N on the best-performing baseline NeuralCCD) and significantly improves fault localization effectiveness (e.g., 24% improvement on the best-performing baseline Dstar).
- Book Chapter
15
- 10.1002/9781119880929.ch1
- Apr 20, 2023
This chapter describes traditional and intuitive fault localization techniques, including program logging, assertions, breakpoints, and profiling. Many advanced fault localization techniques have surfaced recently using the idea of causality, which is related to philosophical theories with an objective to characterize the relationship between events/causes and a phenomenon/effect. The chapter aims to classify fault localization techniques into nine categories, including slicing-based, spectrum-based, statistics-based, machine learning-based, data mining-based, IR-based, model-based, spreadsheet-based techniques, and additional emerging techniques. It lists some of the popular subject programs that have been used in different case studies and discusses how these programs have evolved through the years. The chapter describes different evaluation metrics to assess the effectiveness of fault localization techniques. One challenge for many empirical studies on software fault localization is that they require appropriate tool support for automatic or semiautomatic data collection and suspiciousness computation. The chapter also presents an overview on the key concepts discussed in this book.