Coincidental correctness in the Defects4J benchmark

Rawad Abou Assi,Chadi Trad,Wes Masri,Marwan Maalouf

doi:10.1002/stvr.1696

Abstract

SummaryCoincidental correctness (CC) arises when a defective program produces the correct output despite the fact that the defect within was exercised. Researchers have recognized the negative impact of CC, and the authors have previously conducted a study demonstrating its prevalence in test suites. However, that study was limited to system tests, and small subjects seeded with artificial defects. In this paper, we conduct a wider scope study of CC that addresses the following research questions in the context of theDefects4Jbenchmark. RQ1: Is CC prevalent in Defects4J? RQ2: Is CC affected by the testing levels in Defects4J? RQ3: Do CC tests induce peculiar infection paths in Defects4J? Furthermore, we useJTidyandNanoXMLto address the following question. RQ4: Are the infections likely to be nullified within or outside the buggy method? To answer RQ1, we manually injected two code checkers for each of the 395Defects4Jdefects: (i) a weak checker that detects weak CC tests by monitoring whether the defect was reached; and (ii) a strong checker that detects strong CC tests by monitoring whether the defect was reached and the program has transitioned into an infectious state. Our results showed that CC is prevalent inDefects4J, as we observed 38.1× more strong CC tests than failing tests and 60.5× more weak CC tests than failing tests. Testing has traditionally been classified into several levels that include unit, module, integration, system, and acceptance. Meanwhile, the test cases inDefects4Jare not classified into any of the aforementioned testing levels. In addition, the boundaries between such levels are not clear because of the lack of a clear universal definition. Therefore, in order to answer RQ2, we derive the testing level of a test case from its method coverage information; specifically, we base it on the number and frequency of execution of the methods it covers. Our results showed that CC is present at all testing levels, but is more prevalent in high testing levels than in low testing levels. To answer RQ3, we contrasted the characteristics of the infection propagation paths induced by theDefects4Jfailing tests to those induced by the strong CC tests. We observed that the paths induced by the CC tests (i) were considerably longer on average and (ii) comprised a higher number of conditional, modulo, multiplication, division, and invocation statements. Finally, to answer RQ4, which relates to RQ2, we performed an experiment involvingJTidy,NanoXML, and their associated high‐level test suites. We used code checkers to determine whether, in the case of strong CC, the infections were nullified before exiting the buggy function or afterward. All of our observations showed that the infections were nullified after exiting the buggy function. © 2019 John Wiley & Sons, Ltd.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Coincidental correctness in the Defects4J benchmark

Abstract

Talk to us

Similar Papers

More From: Software Testing, Verification and Reliability

Lead the way for us

Journal: Software Testing, Verification and Reliability	Publication Date: Mar 25, 2019
Citations: 20

Similar Papers

Genetic Algorithm for Test Suite Optimization: An Experimental Investigation of Different Selection Methods
Chetan J Shingadiya Et.Al
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 12
Chetan J Shingadiya Et.AlChetan J Shingadiya Et.Al
10 Apr 2021
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 12

Empirically evaluating Greedy-based test suite reduction methods at different levels of test suite complexity
Chu-Ti Lin ... Kai-Wei Tang
Science of Computer Programming | VOL. 150
Chu-Ti Lin, et. al.Chu-Ti Lin ... Kai-Wei Tang
24 May 2017
Science of Computer Programming | VOL. 150

Social Network Analysis in Software Testing to Categorize Unit Test Cases Based on Coverage Information
Negar Koochakzadeh ... Reda Alhajj
-
Negar Koochakzadeh, et. al.Negar Koochakzadeh ... Reda Alhajj
01 Sep 2011
01 Sep 2011

Evaluating test suite characteristics, cost, and effectiveness of FSM-based testing methods
Andre Takeshi Endo ... Adenilso Simao
Information and Software Technology | VOL. 55
Andre Takeshi Endo, et. al.Andre Takeshi Endo ... Adenilso Simao
21 Jan 2013
Information and Software Technology | VOL. 55

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Coincidental correctness in the Defects4J benchmark

Abstract

Talk to us

Similar Papers

More From: Software Testing, Verification and Reliability