Combining Coverage and Expert Features with Semantic Representation for Coincidental Correctness Detection

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Coincidental correctness (CC) can be misleading for developers because it gives the impression that the code is functioning correctly when there are hidden faults. To mitigate the negative impacts of CC test cases, extensive research has been conducted on their detection, employing either coverage-based or expert-based features. These studies have yielded promising results. Coverage and expert features each provide unique insights into program execution, yet the literature has not fully explored the combined potential of these two feature sets to enhance the detection of CC. Additionally, the rich semantics of the test code and focal method have not been fully utilized. Therefore, we propose to build a unified model, CORE, that integrates coverage and expert features with semantic representations of test and focal methods to improve the detection of CC test cases. We make a comprehensive evaluation with six state-of-the-art baselines on the widely-used Defects4J benchmark. The experimental results show that CORE outperforms the baselines in terms of CC detection accuracy, with a substantial improvement (i.e., 40% improvement on average in terms of F1 score). Then, we conduct the ablation experiment to show that the coverage, expert, and semantics contribute to CORE. CORE can also improve the effectiveness of spectrum-based and mutation-based fault localization performance (e.g., 50% improvements for spectrum-based formula Dstar and 44% improvements for mutation-based method MUSE under relabeling strategy).

Similar Papers
  • Conference Article
  • Cite Count Icon 5
  • 10.1109/saner56733.2023.00018
NeuralCCD: Integrating Multiple Features for Neural Coincidental Correctness Detection
  • Mar 1, 2023
  • Zhou Tao + 3 more

Fault localization seeks to locate the suspicious statements possible for causing a program failure. Experimental evidence shows that fault localization effectiveness is affected adversely by the existence of coincidental correctness (CC) test cases, where a CC test case denotes the test case which executes a fault but no failure occurs. Even worse, CC test cases are prevailing in realistic testing and debugging, leading to a severe issue on fault localization effectiveness. Thus, it is indispensable to accurately detect CC test cases and alleviate their harmful effect on fault localization effectiveness.To address this problem, we propose NeuralCCD: a neural coincidental correctness detection approach by integrating multiple features. Specifically, NeuralCCD first leverages suspiciousness score, coverage ratio and similarity to define three CC detection features. Based on these CC detection features and CC labels, NeuralCCD utilizes multi-layer perceptron to learn a different feature-based model for a program, and finally combine the trained models of different programs as an ensemble system to detect CC test cases. To evaluate the effectiveness of NeuralCCD, we conduct large-scale experiments on 247 faulty version of five representative benchmarks and compare NeuralCCD with four state-of-the-art CC detection approaches. The experimental results show that NeuralCCD significantly improves the effectiveness of CC detection, e.g., NeuralCCD yields by at most 109.5%, 93% and 81.3% improvement of Top-1, Top-3 and Top-5 over Tech-I when utilized in Dstar formular.

  • Conference Article
  • 10.1109/ase63991.2025.00127
Sifting Truth from Coincidences: A Two-Stage Positive and Unlabeled Learning Model for Coincidental Correctness Detection
  • Nov 16, 2025
  • Chunyan Liu + 4 more

Fault localization (FL) can identify the fault's location by analyzing the execution information from test cases in the program. This execution information serves as the foundation for FL to infer latent causal relationships between fault entities and failed results. However, this execution information contains coincidental correctness (CC), which reduces the accuracy of FL. CC arises when a test case executes faulty program entities but still produces the correct output, leading to misleading FL inferences. In widely used datasets, the presence of CC compromises the reliability of passed test cases (i.e., negative labels). In contrast, failed test cases (i.e., positive labels) remain definitive. In FL scenarios, unlabeled data is typically abundant and primarily consists of passed test cases. Therefore, systematically leveraging positive and unlabeled data for accurate CC detection is essential, which is beneficial to FL. To tackle the problem, we propose a two-stagE positiVe and unlAbeled learning model for coiNcidental correctneSs detection, EVANS. EVANS defines failed test cases as positive samples and treats the remaining ones as unlabeled data. It comprises two core modules: (1) A module for selecting high-quality pseudo-negative samples. This module leverages vector distance metrics to identify high-quality pseudo-negative test cases, using inter-class distances computed via a pre-trained model. (2) A weakly supervised contrastive learning module. This module utilizes the labeled samples from Stage (1) to train a contrastive learning model, which then detects CC in unlabeled test cases. Experimental results demonstrate that EVANS significantly outperforms current CC detection methods.

  • Research Article
  • 10.1109/tr.2026.3668421
Enhanced Feature Representation via Hybrid Feature Fusion for Coincidental Correctness Detection
  • Jan 1, 2026
  • IEEE Transactions on Reliability
  • Tao Zhang + 5 more

Enhanced Feature Representation via Hybrid Feature Fusion for Coincidental Correctness Detection

  • Research Article
  • Cite Count Icon 4
  • 10.1109/tse.2024.3481893
Towards More Precise Coincidental Correctness Detection With Deep Semantic Learning
  • Dec 1, 2024
  • IEEE Transactions on Software Engineering
  • Huan Xie + 6 more

Coincidental correctness (CC) is a situation during the execution of a test case, the buggy entity is executed, but the program behaves correctly as expected. Many automated fault localization (FL) techniques use runtime information to discover the underlying connection between the executed buggy entity and the failing test result. The existence of CC will weaken such connection, mislead the FL algorithms to build inaccurate models, and consequently, decrease the localization accuracy. To alleviate the adverse effect of CC on FL, CC detection techniques have been proposed to identify the possible CC tests via heuristic or machine learning algorithms. However, their performance on precision is not satisfactory since they overestimate the possible CC tests and are insufficient in learning the deep semantic features. In this work, we propose a novel <u>Tri</u>plet network-based <u>Co</u>incidental <u>Co</u>rrectness detection technique (<i>i.e.,</i> <b>TriCoCo</b>) to overcome the limitations of the prior works. <b>TriCoCo</b> narrows the possible CC tests by designing three features to identify genuine passing tests. Instead of using all tests as inputs by existing techniques, <b>TriCoCo</b> takes the identified genuine passing tests and failing ones to train a triplet model that can evaluate their relative distance. Finally, <b>TriCoCo</b> infers the probability of being a CC test of the test in the rest of the passing tests by using the trained triplet model. We conduct large-scale experiments to evaluate <b>TriCoCo</b> based on the widely-used Defects4J benchmark. The results demonstrate that <b>TriCoCo</b> can improve not only the precision of CC detection but also the effectiveness of FL techniques, <i>e.g.,</i> the precision of <b>TriCoCo</b> is 80.33<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula> on average, and <b>TriCoCo</b> boosts the efficacy of DStar by 18<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>–74<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula> in terms of MFR metric when compared to seven state-of-the-art CC detection baselines.

  • Conference Article
  • 10.1109/ase63991.2025.00041
From Sparse to Structured: A Diffusion-Enhanced and Feature-Aligned Framework for Coincidental Correctness Detection
  • Nov 16, 2025
  • Huan Xie + 4 more

Coincidental correctness (CC) refers to test cases that execute faulty code but still produce excepted outputs. This phenomenon introduces noise into the data of software testing-related tasks. As demonstrated in the literature, CC has negative impact on test suite reduction, test case prioritization, fault localization, and automated program repair. Thus, it is essential to detect and mitigate the impact of CC. Although CC is commonly observed across a large number of programs, CC test cases are typically sparse within each program’s test suite. In other words, CC test cases generally make up merely a small portion of the passing test cases. The proportions vary from 3.27% to 31.74% within Defects4J V1.4. This results in a highly imbalanced distribution of CC versus non-CC test cases, posing challenges for accurate detection.To address this issue, we propose a Diffusion-Enhanced and Feature-Aligned Framework for Coincidental Correctness detection, named DEFACC, to obtain more structured representations of test cases. Specifically, DEFACC first introduces a diffusion-based generation module. This module generates new CC samples from original samples to alleviate class imbalance issue and enhance the diversity of CC samples. However, generated feature samples may deviate from the distribution of real CC samples. Such shifts can hurt model reliability and generalization. To resolve this, DEFACC integrates a feature alignment module that is founded on the Maximum Mean Discrepancy (MMD) loss. This module enforces distributional consistency between generated and original CC samples during training. Together, these components ensure that the augmented samples are from sparse to structured, which is not only quantitatively balanced but also semantically faithful. Experimental results show that the DEFACC significantly improves the performance of existing CC detection methods and provides a stronger representation foundation for accurate fault localization.

Save Icon
Up Arrow
Open/Close