Classifying generated white-box tests: an exploratory study

Dávid Honfi,Zoltán Micskei

doi:10.1007/s11219-019-09446-5

Dávid Honfi, Zoltán Micskei

Open Access

https://doi.org/10.1007/s11219-019-09446-5

Copy DOI

Abstract

White-box test generation analyzes the code of the system under test, selects relevant test inputs, and captures the observed behavior of the system as expected values in the tests. However, if there is a fault in the implementation, this fault could get encoded in the assertions (expectations) of the tests. The fault is only recognized if the developer, who is using test generation, is also aware of the real expected behavior. Otherwise, the fault remains silent both in the test and in the implementation. A common assumption is that developers using white-box test generation techniques need to inspect the generated tests and their assertions, and to validate whether the tests encode any fault or represent the real expected behavior. Our goal is to provide insights about how well developers perform in this classification task. We designed an exploratory study to investigate the performance of developers. We also conducted an internal replication to increase the validity of the results. The two studies were carried out in a laboratory setting with 106 graduate students altogether. The tests were generated in four open-source projects. The results were analyzed quantitatively (binary classification metrics and timing measurements) and qualitatively (by observing and coding the activities of participants from screen captures and detailed logs). The results showed that participants tend to incorrectly classify tests encoding both expected and faulty behavior (with median misclassification rate 20%). The time required to classify one test varied broadly with an average of 2 min. This classification task is an essential step in white-box test generation that notably affects the real fault detection capability of such tools. We recommended a conceptual framework to describe the classification task and suggested taking this problem into account when using or evaluating white-box test generators.

Highlights

Due to the ever-increasing importance of software, assessment of its quality is essential
The results show that deciding whether a test encodes expected behavior was a challenging task for the participants even in a laboratory setting with artificially prepared environments
We found that the Matthews correlation coefficient was between only around 0.4 and 0.55 for most of the participants

Summary

Introduction

Due to the ever-increasing importance of software, assessment of its quality is essential. Software testing is one of the most frequently used techniques to assess and improve software quality. To alleviate the tasks of developers, several automated test generation techniques have been proposed (Anand et al 2013). These advanced techniques are often available as off-the-shelf tools, e.g., Pex/IntelliTest (Tillmann and de Halleux 2008), Randoop (Pacheco et al 2007), or EvoSuite (Fraser and Arcuri 2013). These tools can rely only on the source/binary code to select relevant test inputs. If the implementation is used alone as input for test generation, these assertions—created in the generated code—-contain the observed behavior, not the expected one

Methods

Results

Discussion

Conclusion