Representation learning for coincidental correctness in fault localization
Representation learning for coincidental correctness in fault localization
- Research Article
9
- 10.1002/stvr.1762
- Jan 9, 2021
- Software Testing, Verification and Reliability
According to the reachability–infection–propagation (RIP) model, three conditions must be satisfied for program failure to occur: (1) the defect's location must bereached, (2) the program's state must becomeinfectedand (3) the infection mustpropagateto the output.Weak coincidental correctness(orweak CC) occurs when the program produces the correct output, while condition (1) is satisfied but conditions (2) and (3) are not satisfied.Strong coincidental correctness(orstrong CC) occurs when the output is correct, while both conditions (1) and (2) are satisfied but not (3). The prevalence ofCCwas previously recognized. In addition, the potential for its negative effect on spectrum‐based fault localization (SBFL) was analytically demonstrated; however, this was not empirically validated. UsingDefects4J, this paper empirically studies the impact ofweakandstrong CCon three well‐researched coverage‐based fault detection and localization techniques, namely, test suite reduction (TSR), test case prioritization (TCP) and SBFL. Our study, which involved 52 SBFL metrics, provides the following empirical evidence. (i) The negative impact ofCCtests on TSR and TCP is very significant. In addition, cleansing theCCtests was observed to yield (a) a 100% TSR defect detection rate for all subject programs and (b) an improvement of TCP for over 92% of the subjects. (ii) The impact ofCCtests on SBFL varies widely w.r.t. the metric used. The negative impact was strong for 11 metrics, mild for 37, non‐measurable for 1 and non‐existent for 3 metrics. Interestingly, the negative impact was mild for the 9 most popular and/or most effective SBFL metrics. In addition, cleansing theCCtests resulted in the deterioration of SBFL for a considerable number of subject programs. (iii) Increasing the proportion ofCCtests has a limited impact on TSR, TCP and SBFL. Interestingly, for TSR and TCP and 11 SBFL metrics, small and large proportions ofCCtests are strongly harmful. (iv) Lastly,weakandstrong CCare equally detrimental in the context of TSR, TCP and SBFL.
- Research Article
- 10.1002/spe.2104
- Jan 16, 2012
- Software: Practice and Experience
Software systems today are large and complex. At the same time, the time to market is extremely short because of competition. As a result, program debugging for real-life systems is very difficult. In general, the debugging process consists of three tasks, namely, fault localization, fault repair, and retesting. In particular, fault localization is generally considered to be the most challenging. It is recognized as time-consuming and tedious if conducted manually. On the other hand, formal methods suffer from scalability problems, and static techniques are imprecise. Automatic statistical fault localization techniques are regarded as the most promising option. They compare passed and failed executions of a faulty program and produce a suspiciousness ranking of program entities (such as statements or predicates). Developers may then follow up with the list sequentially to identify program faults. Unfortunately, although a large number of statistical fault localization techniques are available, they have not reached the maturity to pinpoint accurately and precisely the locations of faults. Also, the recording and replaying of passed and failed executions as well as fault repair without introducing new bugs remain unresolved issues. Furthermore, researchers often make unrealistic assumptions, and software subjects under study do not necessarily reflect the fault characteristics of large industrial applications. There is plenty of room for improvement. The 2nd International Workshop on Program Debugging (IWPD 2011) was a full day workshop held in conjunction with the 35th Annual International Computer Software and Applications Conference (COMPSAC 2011) in Munich, Germany in July 2011. It serves as a platform for researchers and practitioners to exchange ideas, present new advancements, and identify further challenges in program debugging. It brings to light the latest challenges and advances in research and practice in program debugging, with a special emphasis on methodology, technology, and environment. Two keynote speeches were given by internationally renowned researchers — T. Y. Chen of Swinburne University of Technology, Australia and W. K. Chan of City University of Hong Kong, Hong Kong. There were also sessions for paper presentations and panel discussions. We shortlisted three papers from the workshop and invited the authors to submit an extended version to Software: Practice and Experience. Two papers were accepted for this focus section after going through two rounds of rigorous reviews involving two to three anonymous reviewers for each article. Both accepted papers address the important area of statistical fault location. The first paper, entitled ‘In quest of the science in statistical fault localization’ by W. K. Chan and Yan Cai, is an extended version of the keynote speech delivered by the first author in IWPD 2011. A vital element in research is to know the shortcomings of the current state of the art. In this paper, the authors conduct a critical review of existing work on statistical fault localization (including their own), highlight misconceptions and unnecessary assumptions, and provide remedial measures to rectify such malpractices. The authors point out that a lot of current research in statistical fault localization does not consider coincidental correctness, which means that the execution of a faulty statement may not necessarily lead to a program failure, even though this important concept has been known to software testers for decades. Also, existing fault localization techniques compare the similarities and dissimilarities between passed and failed executions to locate faults. These similarity coefficients estimate the probability that a particular program entity causes a failure, but ignore the noise caused by other entities. The authors point out the importance of a noise-reduction mechanism for the similarity coefficients. Another issue is that existing researchers often assume that they are dealing with large samples, where the central limit theorem applies. Empirical studies by the authors show that this assumption is often invalid. It is unrealistic to expect the availability of execution profiles with thousands of test verdicts for the average programs. A developer needs to debug a program even if a small number of failures have been revealed. When the number of samples is small, nonparametric statistical techniques should be applied. The authors conclude the paper by giving an insightful summary of the challenges in statistical fault localization that may benefit researchers in software engineering and related software areas. The second paper is entitled ‘A consensus-based strategy to improve the quality of fault localization’ by Vidroha Debroy and W. Eric Wong. Quite a number of statistical fault localization techniques have been proposed. Each of them claims to be superior to others in one aspect or another using different data sets. There is, however, no single technique that is definitely better than others in all aspects. In this paper, the authors put forward an integrated approach to address the issue. Rather than proposing yet another new technique that captures the more promising features of existing techniques, the authors propose a consensus-based strategy, which combines the rankings of several techniques. Using the Borda method, a consolidated ranking is produced by integrating various statement rankings that result from individual techniques. The scale of the proposed approach can be easily extended or retracted because new fault localization techniques can be added by the inclusion of their rankings, or existing techniques can be excluded by the removal of their rankings. Also, because different techniques operate on the same input data set, the overhead of the consensus is minimal. The overall ranking can be determined in linear time. The effectiveness of the consensus-based approach has been validated using three popular fault localization techniques (Tarantula, Ochiai, and H3) on the Siemens suite of programs as well as the Ant, grep, gzip, make, and space programs. The empirical study shows that the performance of the proposed approach is close to the best results of the techniques under study. Finally, I would like to thank Professor Nigel Horspool and Professor Andy Wellings, Editors of Software: Practice and Experience, for kindly agreeing to publish this focus section.
- Research Article
6
- 10.1016/j.jss.2023.111900
- Nov 18, 2023
- Journal of Systems and Software
Trace matrix optimization for fault localization
- Conference Article
3
- 10.1109/saner56733.2023.00018
- Mar 1, 2023
Fault localization seeks to locate the suspicious statements possible for causing a program failure. Experimental evidence shows that fault localization effectiveness is affected adversely by the existence of coincidental correctness (CC) test cases, where a CC test case denotes the test case which executes a fault but no failure occurs. Even worse, CC test cases are prevailing in realistic testing and debugging, leading to a severe issue on fault localization effectiveness. Thus, it is indispensable to accurately detect CC test cases and alleviate their harmful effect on fault localization effectiveness.To address this problem, we propose NeuralCCD: a neural coincidental correctness detection approach by integrating multiple features. Specifically, NeuralCCD first leverages suspiciousness score, coverage ratio and similarity to define three CC detection features. Based on these CC detection features and CC labels, NeuralCCD utilizes multi-layer perceptron to learn a different feature-based model for a program, and finally combine the trained models of different programs as an ensemble system to detect CC test cases. To evaluate the effectiveness of NeuralCCD, we conduct large-scale experiments on 247 faulty version of five representative benchmarks and compare NeuralCCD with four state-of-the-art CC detection approaches. The experimental results show that NeuralCCD significantly improves the effectiveness of CC detection, e.g., NeuralCCD yields by at most 109.5%, 93% and 81.3% improvement of Top-1, Top-3 and Top-5 over Tech-I when utilized in Dstar formular.
- Research Article
16
- 10.1142/s0218194013500186
- Jun 1, 2013
- International Journal of Software Engineering and Knowledge Engineering
Coverage-based fault localization techniques leverage the coverage information to identify the faulty elements of a program. However, these techniques can be adversely affected by coincidental correctness, which occurs when the defect is executed but no failure is revealed. In this paper, we propose a clustering-based strategy to identify coincidental correctness in fault localization. The insight behind this strategy is that tests in the same cluster have similar behaviors. Thus a passed test in a cluster with many failed tests is highly possible to be coincidentally correct because it has the potential to execute the faulty elements as those failed ones do. We evaluated this technique from two aspects: the ability to identify coincidental correctness and the effectiveness to improve fault localization. The experimental results show that our strategy can alleviate the coincidental correctness problem and improve the effectiveness of fault localization.
- Research Article
2
- 10.3390/sym14061267
- Jun 19, 2022
- Symmetry
An important research aspect of Spectrum-Based Fault Localization (SBFL) is the influence factors of the effectiveness of suspiciousness formulas from the perspective of symmetry. Coincidental correctness is one of the most important factors impacting the effectiveness of suspiciousness formulas. The influence of fault localization by coincidental correctness has attracted a large amount of research in the perspective of empirical study; however, it can hardly be considered as sufficiently comprehensive when there are a large number of the symmetrical suspiciousness formulas. Therefore, we first develop an innovative theoretical framework with function derivation investigating suspiciousness formulas impacted by coincidental correctness. We define three types of relations between formulas affected by coincidental correctness: namely, improved type, invariant type and uncertain type. We investigated 30 suspiciousness formulas using this framework and group them into three categories. Furthermore, we conduct an empirical study to verify the effectiveness of SBFL affected by coincidental correctness on four relatively large C programs. We proved that coincidental correctness has a positive effect on 23 out of these 30 formulas, no effect on 5 of them, and the effect on the remaining 2 of them depend on certain conditions. The experimental results show that the effectiveness of some suspiciousness formulas can be enhanced and that of some suspiciousness formulas remain unchanged.
- Conference Article
98
- 10.1109/icse43902.2021.00067
- May 1, 2021
In this paper, we propose DeepRL4FL, a deep learning fault localization (FL) approach that locates the buggy code at the statement and method levels by treating FL as an image pattern recognition problem. DeepRL4FL does so via novel code coverage representation learning (RL) and data dependencies RL for program statements. Those two types of RL on the dynamic information in a code coverage matrix are also combined with the code representation learning on the static information of the usual suspicious source code. This combination is inspired by crime scene investigation in which investigators analyze the crime scene (failed test cases and statements) and related persons (statements with dependencies), and at the same time, examine the usual suspects who have committed a similar crime in the past (similar buggy code in the training data). For the code coverage information, DeepRL4FL first orders the test cases and marks error-exhibiting code statements, expecting that a model can recognize the patterns discriminating between faulty and non-faulty statements/methods. For dependencies among statements, the suspiciousness of a statement is seen taking into account the data dependencies to other statements in execution and data flows, in addition to the statement by itself. Finally, the vector representations for code coverage matrix, data dependencies among statements, and source code are combined and used as the input of a classifier built from a Convolution Neural Network to detect buggy statements/methods. Our empirical evaluation shows that DeepRL4FL improves the top-1 results over the state-of-the-art statement-level FL baselines from 173.1% to 491.7%. It also improves the top-1 results over the existing method-level FL baselines from 15.0% to 206.3%.
- Conference Article
118
- 10.1109/icse.2009.5070507
- Jan 1, 2009
Recent techniques for fault localization leverage code coverage to address the high cost problem of debugging. These techniques exploit the correlations between program failures and the coverage of program entities as the clue in locating faults. Experimental evidence shows that the effectiveness of these techniques can be affected adversely by coincidental correctness, which occurs when a fault is executed but no failure is detected. In this paper, we propose an approach to address this problem. We refine code coverage of test runs using control- and data-flow patterns prescribed by different fault types. We conjecture that this extra information, which we call context patterns, can strengthen the correlations between program failures and the coverage of faulty program entities, making it easier for fault localization techniques to locate the faults. To evaluate the proposed approach, we have conducted a mutation analysis on three real world programs and cross-validated the results with real faults. The experimental results consistently show that coverage refinement is effective in easing the coincidental correctness problem in fault localization techniques.
- Conference Article
24
- 10.1109/compsac.2014.32
- Jul 1, 2014
Although empirical studies have demonstrated the usefulness of statistical fault localizations based on code coverage, the effectiveness of these techniques may be deteriorated due to the presence of some undesired circumstances such as the existence of coincidental correctness where one or more passing test cases exercise a faulty statement and thus causing some confusion to decide whether the underlying exercised statement is faulty or not. Fault localizations based on coverage can be improved if all possible instances of coincidental correctness are identified and proper strategies are employed to deal with these troublesome test cases. We introduce a technique to effectively identify coincidentally correct test cases. The proposed technique combines support vector machines and ensemble learning to detect mislabeled test cases, i.e. Coincidentally correct test cases. The ensemble-based support vector machine then can be used to trim a test suite or flip the test status of the coincidental correctness test cases and thus improving the effectiveness of fault localizations.
- Conference Article
18
- 10.1109/icst.2012.130
- Apr 1, 2012
Coincidentally correct test cases are those that execute faulty statements but do not cause failures. Such test cases reduce the effectiveness of spectrum-based fault localization techniques, such as Ochiai. These techniques calculate a suspiciousness score for each statement. The suspiciousness score estimates the likelihood that the program will fail if the statement is executed. The presence of coincidentally correct test cases reduces the suspiciousness score of the faulty statement, thereby reducing the effectiveness of fault localization. We present two approaches that predict coincidentally correct test cases and use the predictions to improve the effectiveness of spectrum based fault localization. In the first approach, we assign weights to passing test cases such that the test cases that are likely to be coincidentally correct obtain low weights. Then we use the weights to calculate suspiciousness scores. In the second approach, we iteratively predict and remove coincidentally correct test cases, and calculate the suspiciousness scores with the reduced test suite. In this dissertation, we investigate the cost and effectiveness of our approach to predicting coincidentally correct test cases and utilizing the predictions. We report the results of our preliminary evaluation of effectiveness and outline our research plan.
- Conference Article
4
- 10.1109/issrew.2013.6688889
- Nov 1, 2013
In software debugging, statistical fault localization techniques contrast dynamic spectra of program elements to estimate the location of faults in faulty programs. Coincidental correctness may have a negative impact on these techniques because faults can also be triggered in an observed non-failed run and thus disturbs the assessment of fault locations. However, eliminating the confounding relies on the accuracy of recognizing them. This paper makes use of the presence of coincidental correctness as an effective interface to the success of fault localization. We calculate the distribution overlapping of dynamic spectrum in failed runs and in non-failed runs to find out the fault-leading predicates, and further reduce the region by referencing the inter-class distances of the spectra to suppress the less suspicious candidates. Empirical results show that our technique can outperform representative existing predicate-based fault localization techniques.
- Conference Article
14
- 10.1109/iccis.2012.361
- Aug 1, 2012
In order to improve efficiency of debugging, many fault localization techniques have been proposed to find out the program entities that are likely to contain faults. However, recent researches indicate that the effectiveness of fault localization techniques suffers from occurrences of coincidental correctness, which means execution result of test cases that exercise faulty statements indicate no failure information. This paper presents a strategy using cluster analysis to identify coincidental correctness in test sets for fault localization. Test cases that exercise same faulty statements are expected to be grouped together by cluster analysis, and then during debugging these tests that are identified to contain coincidental correctness can be used to improve effectiveness of fault localization techniques. To evaluate our technique, we conducted an experiment on some Siemens Suit programs. The experimental results show that the strategy is effective at automatically identifying coincidental correct tests.
- Conference Article
3
- 10.1109/icst46399.2020.00013
- Aug 6, 2020
Researchers have used execution profiles to enable coverage-based techniques in areas such as defect detection and fault localization. Typical profile elements include functions, statements, and branches, which are structural in nature. Such elements might not always discriminate failing runs from passing runs, which renders them ineffective in some cases. This motivated us to investigate alternative profiles, namely, substate profiles that aim at approximating the state of a program (as opposed to its execution path). Substate profiling is a recently presented form of state profiling that is practical, fine-grained, and generic enough to be applicable to various profile-based analyses.This paper presents an empirical study demonstrating how complementing structural profiles with substate profiles would benefit Test Suite Reduction (TSR), Test Case Prioritization (TCP), and Spectrum-based Fault Localization (SBFL). Using the Defects4J benchmark, we contrasted the effectiveness of TSR, TCP, and SBFL when using the structural profiles only to when using the concatenation of the structural and substate profiles. Leveraging substate profiling enhanced the effectiveness of all three techniques. For example: 1) For TSR, 86 more versions exhibited 100% defect detection rate. 2) For TCP, 22 more versions had one of their failing tests ranked among the top 20%. 3) For SBFL,substate profiling localized 14 faults that structural profiling failed to localize. Furthermore, our study showed that the improvement due tosubstate profiling was noticeably more significant in the presence of coincidentally correct tests than in their absence. This positions substate profiling as a promising basis for mitigating the negative effect of coincidental correctness.
- Conference Article
64
- 10.1109/icst.2010.22
- Jan 1, 2010
Researchers have argued that for failure to be observed the following three conditions must be met: 1) the defect is executed, 2) the program has transitioned into an infectious state, and 3) the infection has propagated to the output. Coincidental correctness arises when the program produces the correct output, while conditions 1) and 2) are met but not 3). In previous work, we showed that coincidental correctness is prevalent and demonstrated that it is a safety reducing factor for coverage-based fault localization. This work aims at cleansing test suites from coincidental correctness to enhance fault localization. Specifically, given a test suite in which each test has been classified as failing or passing, we present three variations of a technique that identify the subset of passing tests that are likely to be coincidentally correct. We evaluated the effectiveness of our techniques by empirically quantifying the following: 1) how accurately did they identify the coincidentally correct tests, 2) how much did they improve the effectiveness of coverage-based fault localization, and 3) how much did coverage decrease as a result of applying them. Using our better performing technique and configuration, the safety and precision of fault-localization was improved for 88% and 61% of the programs, respectively.
- Research Article
87
- 10.1145/2559932
- Feb 1, 2014
- ACM Transactions on Software Engineering and Methodology
Researchers have argued that for failure to be observed the following three conditions must be met: C R = the defect was reached; C I = the program has transitioned into an infectious state; and C P = the infection has propagated to the output. Coincidental Correctness (CC) arises when the program produces the correct output while condition C R is met but not C P . We recognize two forms of coincidental correctness, weak and strong. In weak CC , C R is met, whereas C I might or might not be met, whereas in strong CC , both C R and C I are met. In this work we first show that CC is prevalent in both of its forms and demonstrate that it is a safety reducing factor for Coverage-Based Fault Localization (CBFL). We then propose two techniques for cleansing test suites from coincidental correctness to enhance CBFL, given that the test cases have already been classified as failing or passing. We evaluated the effectiveness of our techniques by empirically quantifying their accuracy in identifying weak CC tests. The results were promising, for example, the better performing technique, using 105 test suites and statement coverage, exhibited 9% false negatives, 30% false positives, and no false negatives nor false positives in 14.3% of the test suites. Also using 73 test suites and more complex coverage, the numbers were 12%, 19%, and 15%, respectively.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.