Prioritizing Test Inputs for DNNs Using Training Dynamics
Deep Neural Network (DNN) testing is one of the most widely-used techniques to guarantee the quality of DNNs. However, DNN testing typically requires the ground truth of test inputs, which is time-consuming and labor-intensive to obtain. To relieve the labeling-cost problem of DNN testing, we propose TDPR, a test input prioritization technique for DNNs based on training dynamics. The key insight of TDPR is that bug-revealing samples exhibit different learning trajectories compared to normal ones. Based on this, TDPR constructs a learning trajectory for each test input, which characterizes the evolving learning behavior of DNNs. Then, TDPR extracts features from these learning trajectories and applies learning-to-rank techniques to build a ranking model, which can intelligently utilize the generated features to prioritize test inputs. To evaluate TDPR, we conduct extensive experiments on 8 diverse subjects, considering various domains of test inputs, different DNN architectures, and diverse types of test inputs. The evaluation results demonstrate that TDPR outperforms 7 baseline approaches in both prioritizing test inputs and guiding the retraining of DNNs.
- Research Article
18
- 10.1145/3607191
- Nov 24, 2023
- ACM Transactions on Software Engineering and Methodology
Graph Neural Networks (GNNs) have achieved promising performance in a variety of practical applications. Similar to traditional DNNs, GNNs could exhibit incorrect behavior that may lead to severe consequences, and thus testing is necessary and crucial. However, labeling all the test inputs for GNNs can be costly and time-consuming, especially when dealing with large and complex graphs, which seriously affects the efficiency of GNN testing. Existing studies have focused on test prioritization for DNNs, which aims to identify and prioritize fault-revealing tests (i.e., test inputs that are more likely to be misclassified) to detect system bugs earlier in a limited time. Although some DNN prioritization approaches have been demonstrated effective, there is a significant problem when applying them to GNNs: They do not take into account the connections (edges) between GNN test inputs (nodes), which play a significant role in GNN inference. In general, DNN test inputs are independent of each other, while GNN test inputs are usually represented as a graph with complex relationships between each test. In this article, we propose GraphPrior ( GNN -oriented Test Prior itization), a set of approaches to prioritize test inputs specifically for GNNs via mutation analysis. Inspired by mutation testing in traditional software engineering, in which test suites are evaluated based on the mutants they kill, GraphPrior generates mutated models for GNNs and regards test inputs that kill many mutated models as more likely to be misclassified. Then, GraphPrior leverages the mutation results in two ways, killing-based and feature-based methods. When scoring a test input, the killing-based method considers each mutated model equally important, while feature-based methods learn different importance for each mutated model through ranking models. Finally, GraphPrior ranks all the test inputs based on their scores. We conducted an extensive study based on 604 subjects to evaluate GraphPrior on both natural and adversarial test inputs. The results demonstrate that KMGP, the killing-based GraphPrior approach, outperforms the compared approaches in a majority of cases, with an average improvement of 4.76% ~49.60% in terms of APFD. Furthermore, the feature-based GraphPrior approach, RFGP, performs the best among all the GraphPrior approaches. On adversarial test inputs, RFGP outperforms the compared approaches across different adversarial attacks, with the average improvement of 2.95% ~46.69%.
- Research Article
93
- 10.1145/3394112
- Oct 4, 2020
- ACM Transactions on Software Engineering and Methodology
Deep neural network (DNN) has become increasingly popular and DNN testing is very critical to guarantee the correctness of DNN, i.e., the accuracy of DNN in this work. However, DNN testing suffers from a serious efficiency problem, i.e., it is costly to label each test input to know the DNN accuracy for the testing set, since labeling each test input involves multiple persons (even with domain-specific knowledge) in a manual way and the testing set is large-scale. To relieve this problem, we propose a novel and practical approach, called PACE (which is short for P ractical AC curacy E stimation), which selects a small set of test inputs that can precisely estimate the accuracy of the whole testing set. In this way, the labeling costs can be largely reduced by just labeling this small set of selected test inputs. Besides achieving a precise accuracy estimation, to make PACE more practical it is also required that it is interpretable, deterministic, and as efficient as possible. Therefore, PACE first incorporates clustering to interpretably divide test inputs with different testing capabilities (i.e., testing different functionalities of a DNN model) into different groups. Then, PACE utilizes the MMD-critic algorithm, a state-of-the-art example-based explanation algorithm, to select prototypes (i.e., the most representative test inputs) from each group, according to the group sizes, which can reduce the impact of noise due to clustering. Meanwhile, PACE also borrows the idea of adaptive random testing to select test inputs from the minority space (i.e., the test inputs that are not clustered into any group) to achieve great diversity under the required number of test inputs. The two parallel selection processes (i.e., selection from both groups and the minority space) compose the final small set of selected test inputs. We conducted an extensive study to evaluate the performance of PACE based on a comprehensive benchmark (i.e., 24 pairs of DNN models and testing sets) by considering different types of models (i.e., classification and regression models, high-accuracy and low-accuracy models, and CNN and RNN models) and different types of test inputs (i.e., original, mutated, and automatically generated test inputs). The results demonstrate that PACE is able to precisely estimate the accuracy of the whole testing set with only 1.181%∼2.302% deviations, on average, significantly outperforming the state-of-the-art approaches.
- Research Article
534
- 10.1038/339215a0
- May 1, 1989
- Nature
A brief, high-frequency activation of excitatory synapses in the hippocampus produces a long-lasting increase in synaptic strengths called long-term potentiation (LTP). A test input, which by itself does not have a long-lasting effect on synaptic strengths, can be potentiated through association when it is activated at the same time as a separate conditioning input. Neural network modelling studies have also predicted that synaptic strengths should be weakened when test and conditioning inputs are anti-correlated. Evidence for such heterosynaptic depression in the hippocampus has been found for inputs that are inactive or weakly active during the stimulation of a conditioning input, but this depression does not depend on any pattern of test input activity and does not seem to last as long as LTP. We report here an associative long-term depression (LTD) in field CA1 that is produced when a low-frequency test input is negatively correlated in time with a high-frequency conditioning input. LTD of synaptic strength is also produced by activating presynaptic terminals while a postsynaptic neuron is hyperpolarized. This confirms theoretical predictions that the mechanism for associative LTD is homosynaptic and follows a hebbian covariance rule.
- Conference Article
121
- 10.1109/icse43902.2021.00046
- May 1, 2021
Deep Neural Network (DNN) testing is one of the most widely-used ways to guarantee the quality of DNNs. However, labeling test inputs to check the correctness of DNN prediction is very costly, which could largely affect the efficiency of DNN testing, even the whole process of DNN development. To relieve the labeling-cost problem, we propose a novel test input prioritization approach (called PRIMA) for DNNs via intelligent mutation analysis in order to label more bug-revealing test inputs earlier for a limited time, which facilitates to improve the efficiency of DNN testing. PRIMA is based on the key insight: a test input that is able to kill many mutated models and produce different prediction results with many mutated inputs, is more likely to reveal DNN bugs, and thus it should be prioritized higher. After obtaining a number of mutation results from a series of our designed model and input mutation rules for each test input, PRIMA further incorporates learning-to-rank (a kind of supervised machine learning to solve ranking problems) to intelligently combine these mutation results for effective test input prioritization. We conducted an extensive study based on 36 popular subjects by carefully considering their diversity from five dimensions (i.e., different domains of test inputs, different DNN tasks, different network structures, different types of test inputs, and different training scenarios). Our experimental results demonstrate the effectiveness of PRIMA, significantly outperforming the state-of-the-art approaches (with the average improvement of 8.50%~131.01% in terms of prioritization effectiveness). In particular, we have applied PRIMA to the practical autonomous-vehicle testing in a large motor company, and the results on 4 real-world scene-recognition models in autonomous vehicles further confirm the practicability of PRIMA.
- Research Article
1
- 10.1145/3730435
- Jan 20, 2026
- ACM Transactions on Software Engineering and Methodology
The widespread adoption of Deep Neural Networks (DNNs) has brought remarkable advances in machine learning. However, the computational and memory demands of complex DNNs hinder their deployment in resource-constrained environments. To address this challenge, compressed DNN models have emerged, offering a compromise between efficiency and accuracy. Nonetheless, assessing the performance of these compressed models can demand extensive testing, typically requiring high manual labeling costs, rendering the process resource-intensive and time-consuming. To mitigate these challenges, test input prioritization has emerged as a promising technique aimed at reducing labeling costs by prioritizing inputs that are more likely to be misclassified. This enables the early identification of bug-revealing tests with reduced time and manual labeling effort. In this article, we propose PriCod, a novel test prioritization approach designed for compressed DNNs. PriCod leverages the behavior disparities caused by model compression, along with the embeddings of test inputs, to effectively prioritize potentially misclassified tests. It operates on the premises that significant behavior disparities between the models indicate potential misclassifications and that inputs near decision boundaries are more likely to be misclassified. To this end, PriCod generates two types of features for each test input (i.e., deviation features and embedding features) to capture the prediction deviation caused by model compression and the proximity to decision boundaries, respectively. By combining these features, PriCod predicts the probability of misclassification for each test, ranking tests accordingly. We conduct an extensive study to evaluate the effectiveness of PriCod, comparing it with multiple test prioritization approaches. The experimental results demonstrate the effectiveness of PriCod, with average improvements of 7.43%–55.89% on natural test inputs, 7.92%–52.91% on noisy test inputs, and 7.03%–51.59% on adversarial test inputs, compared with existing test prioritization approaches.
- Conference Article
7
- 10.1109/apsec.2015.34
- Dec 1, 2015
It is still a challenge to select good test inputs for concurrent programs within limited testing resources. We present in this paper a test case diversity metric for multi-threaded programs, which evaluates a test input with its effect in exposing concurrent thread interactions. We then propose an input-driven active testing approach with two test input selection strategies based on our test case diversity metric. We implement our testing approach based on Maple, an interleaving coverage-driven active testing tool. The effectiveness and efficiency of our testing approach are compared closely with Maple, which on its own is supplied with random test inputs. Experimental results show that our testing approach can outperform the original active testing approach in the number of test inputs executed and the time usage for fulfilling the interleaving coverage criterion of Maple. The selected test inputs based on our test case diversity metric are very cost-effective in exposing concurrent thread interactions and hence can help detect concurrency bugs with less cost and effort.
- Research Article
1
- 10.1002/smr.2550
- Mar 28, 2023
- Journal of Software: Evolution and Process
The safety and robustness of deep neural networks (DNNs) are currently of great concern. Adequate testing is commonly an effective technique to ensure the software's trustworthiness. However, existing DNN testing methods generate many invalid test inputs, which inevitably brings increased computational overhead and reduces the efficiency of DNN testing. In this paper, we focus on testing task‐specific DNN and investigating diverse, valid and natural test input generation based on data augmentation techniques. Specifically, we propose AugTest, a DNN testing method based on stochastic optimization with momentum, searching for optimal compositions of data augmentation parameters to efficiently generate diverse and valid test inputs. Experimental results show that our proposed method can effectively explore the data manifold space and find valid test inputs with high diversity and naturalness. Compared with the best‐performing baseline, AugTest can generate more test inputs with more average diversity and less average time. Furthermore, the generated test inputs have competitive generalizability to DNNs with different structures. The test error rates exceed 70% when testing other DNN models performing similar tasks using the test inputs generated by AugTest. This implies that our method can produce more valid and generalized data to unveil DNNs' errors.
- Research Article
- 10.1609/aaai.v39i1.32033
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
An image encoder pre-trained by self-supervised learning can be used as a general-purpose feature extractor to build downstream classifiers for various downstream tasks. However, many studies showed that an attacker can embed a trojan into an encoder such that multiple downstream classifiers built based on the trojaned encoder simultaneously inherit the trojan behavior. In this work, we propose TrojanDec, the first data-free method to identify and recover a test input embedded with a trigger. Given a (trojaned or clean) encoder and a test input, TrojanDec first predicts whether the test input is trojaned. If not, the test input is processed in a normal way to maintain the utility. Otherwise, the test input will be further restored to remove the trigger. Our extensive evaluation shows that TrojanDec can effectively identify the trojan (if any) from a given test input and recover it under state-of-the-art trojan attacks. We further demonstrate by experiments that our TrojanDec outperforms the state-of-the-art defenses.
- Research Article
4
- 10.1145/3688835
- Dec 30, 2024
- ACM Transactions on Software Engineering and Methodology
In recent years, significant progress has been made in testing methods for deep neural networks (DNNs) to ensure their correctness and robustness. Coverage-guided criteria, such as neuron-wise, layer-wise, and path-/trace-wise, have been proposed for DNN fuzzing. However, existing coverage-based criteria encounter performance bottlenecks for several reasons: ❶ Testing Adequacy : Partial neural coverage criteria have been observed to achieve full coverage using only a small number of test inputs. In this case, increasing the number of test inputs does not consistently improve the quality of models. ❷ Interpretability : The current coverage criteria lack interpretability. Consequently, testers are unable to identify and understand which incorrect attributes or patterns of the model are triggered by the test inputs. This lack of interpretability hampers the subsequent debugging and fixing process. Therefore, there is an urgent need for a novel fuzzing criterion that offers improved testing adequacy, better interpretability, and more effective failure detection capabilities for DNNs. To alleviate these limitations, we propose NSGen, an approach for DNN fuzzing that utilizes neuron semantics as guidance during test generation. NSGen identifies critical neurons, translates their high-level semantic features into natural language descriptions, and then assembles them into human-readable DNN decision paths (representing the internal decision of the DNN). With these decision paths, we can generate more fault-revealing test inputs by quantifying the similarity between original test inputs and mutated test inputs for fuzzing. We evaluate NSGen on popular DNN models (VGG16_BN, ResNet50, and MobileNet_v2) using CIFAR10, CIFAR100, Oxford 102 Flower, and ImageNet datasets. Compared to 12 existing coverage-guided fuzzing criteria, NSGen outperforms all baselines, increasing the number of triggered faults by 21.4% to 61.2% compared to the state-of-the-art coverage-guided fuzzing criterion. This demonstrates NSGen's effectiveness in generating fault-revealing test inputs through guided input mutation, highlighting its potential to enhance DNN testing and interpretability.
- Book Chapter
- 10.1007/3-540-60114-7_22
- Jan 1, 1995
In testing LSI circuits, it is sometimes important to generate sequences with strong randomness properties with simple implementations as test inputs, since they can avoid time consuming test pattern generations for each fault assumed in each circuit under test (CUT). Randomness properties of test inputs are also useful when there are some unknown, variable or variety factors in CUT, since in these cases, it is impossible to generate efficient test inputs, and the above sequences would provide reasonable results in the sense of ”average behaviors”. M sequences are well known to have strong randomness properties, and they are often used as these test inputs. However, it sometimes is required to have additional elaborations. For example, when parallel independent inputs are required to test CUT with large number of input terminals k, the total length 2k−1 of an M sequence is too long. Therefore, only some partial sequences from entire M sequences can be applicable to the circuit. In these cases, randomness properties assured for entire sequences no longer hold. Still, the resulting sequences are required to have sufficient randomness properties. Randomness properties of three kinds of sequences, sequences from partial two-dimensional M sequences (γ - β plane), vertically-s-shifted sequences, and horizontally-cyclic 1-shifted sequences, all derived from the same original one dimensional M sequence as parallel test inputs to LSIs, are performed and compared in this paper. The results show that sequences from partial γ - β plane are satisfactory as parallel random input sequences for large CUT. Then, the implementations of γ - β plane are discussed. It is seen that simple methods of implementation do exist, and partial sequences from γ - β planes axe also promising from this point of view.KeywordsInput SequenceInput PatternTest InputPlane SequenceRandomness PropertyThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Research Article
62
- 10.1007/s10515-009-0056-x
- Aug 19, 2009
- Automated Software Engineering
Testing-based fault-localization (TBFL) approaches often require the availability of high-statement-coverage test suites that sufficiently exercise the areas around the faults. However, in practice, fault localization often starts with a test suite whose quality may not be sufficient to apply TBFL approaches. Recent capture/replay or traditional test-generation tools can be used to acquire a high-statement-coverage test collection (i.e., test inputs only) without expected outputs. But it is expensive or even infeasible for developers to manually inspect the results of so many test inputs. To enable practical application of TBFL approaches, we propose three strategies to reduce the test inputs in an existing test collection for result inspection. These three strategies are based on the execution traces of test runs using the test inputs. With the three strategies, developers can select only a representative subset of the test inputs for result inspection and fault localization. We implemented and applied the three test-input-reduction strategies to a series of benchmarks: the Siemens programs, DC, and TCC. The experimental results show that our approach can help developers inspect the results of a smaller subset (less than 10%) of test inputs, whose fault-localization effectiveness is close to that of the whole test collection.
- Research Article
1
- 10.1145/3728972
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
Deep learning (DL) frameworks are essential to DL-based software systems, and framework bugs may lead to substantial disasters, thus requiring effective testing. Researchers adopt DL models or single interfaces as test inputs and analyze their execution results to detect bugs. However, floating-point errors, inherent randomness, and the complexity of test inputs make it challenging to analyze execution results effectively, leading to existing methods suffering from a lack of suitable test oracles. Some researchers utilize metamorphic testing to tackle this challenge. They design Metamorphic Relations (MRs) based on input data and parameter settings of a single framework interface to generate equivalent test inputs, ensuring consistent execution results between original and generated test inputs. Despite their promising effectiveness, they still face certain limitations. (1) Existing MRs overlook structural complexity, limiting test input diversity. (2) Existing MRs focus on limited interfaces, which limits generalization and necessitates additional adaptations. (3) Their detected bugs are related to the result consistency of single interfaces and far from those exposed in multi-interface combinations and runtime metrics (e.g., resource usage). To address these limitations, we propose ModelMeta, a model-level metamorphic testing method for DL frameworks with four MRs focused on the structure characteristics of DL models. ModelMeta augments seed models with diverse interface combinations to generate test inputs with consistent outputs, guided by the QR-DQN strategy. It then detects bugs through fine-grained analysis of training loss/gradients, memory/GPU usage, and execution time. We evaluate the effectiveness of ModelMeta on three popular DL frameworks (i.e., MindSpore, PyTorch, and ONNX) with 17 DL models from ten real-world tasks ranging from image classification to object detection. Results demonstrate that ModelMeta outperforms state-of-the-art baselines from the perspective of test coverage and diversity of generated test inputs. Regarding bug detection, ModelMeta has identified 31 new bugs, of which 27 have been confirmed, and 11 have been fixed. Among them, seven bugs existing methods cannot detect, i.e., five wrong resource usage bugs and two low-efficiency bugs. These results demonstrate the practicality of our method.
- Research Article
1
- 10.1002/stvr.1894
- Aug 21, 2024
- Software Testing, Verification and Reliability
ABSTRACTDespite numerous applications of deep learning technologies on critical tasks in various domains, advanced deep neural networks (DNNs) face persistent safety and security challenges, such as the overconfidence in predicting out‐of‐distribution samples and susceptibility to adversarial examples. Thorough testing by exploring the input space serves as a key strategy to ensure their robustness and trustworthiness of these networks. However, existing testing methods focus on disclosing more erroneous model behaviours, overlooking the validity of the generated test inputs. To mitigate this issue, we investigate devising valid test input generation method for DNNs from a predictive uncertainty perspective. Through a large‐scale empirical study across 11 predictive uncertainty metrics for DNNs, we explore the correlation between validity and uncertainty of test inputs. Our findings reveal that the predictive entropy‐based and ensemble‐based uncertainty metrics effectively characterize the input validity demonstration. Building on these insights, we introduce UCTest, an uncertainty‐guided deep learning testing approach, to efficiently generate valid and authentic test inputs. We formulate a joint optimization objective: to uncover the model's misbehaviours by maximizing the loss function and concurrently generate valid test input by minimizing uncertainty. Extensive experiments demonstrate that our approach outperforms the current testing methods in generating valid test inputs. Furthermore, incorporating natural variation through data augmentation techniques into UCTest effectively boosts the diversity of generated test inputs.
- Research Article
11
- 10.1016/j.jss.2011.07.028
- Aug 4, 2011
- The Journal of Systems & Software
SimFuzz: Test case similarity directed deep fuzzing
- Book Chapter
257
- 10.1007/11531142_22
- Jan 1, 2005
This paper describes a technique that selects, from a large set of test inputs, a small subset likely to reveal faults in the software under test. The technique takes a program or software component, plus a set of correct executions — say, from observations of the software running properly, or from an existing test suite that a user wishes to enhance. The technique first infers an operational model of the software’s operation. Then, inputs whose operational pattern of execution differs from the model in specific ways are suggestive of faults. These inputs are further reduced by selecting only one input per operational pattern. The result is a small portion of the original inputs, deemed by the technique as most likely to reveal faults. Thus, the technique can also be seen as an error-detection technique. The paper describes two additional techniques that complement test input selection. One is a technique for automatically producing an oracle (a set of assertions) for a test input from the operational model, thus transforming the test input into a test case. The other is a classification-guided test input generation technique that also makes use of operational models and patterns. When generating inputs, it filters out code sequences that are unlikely to contribute to legal inputs, improving the efficiency of its search for fault-revealing inputs. We have implemented these techniques in the Eclat tool, which generates unit tests for Java classes. Eclat’s input is a set of classes to test and an example program execution—say, a passing test suite. Eclat’s output is a set of JUnit test cases, each containing a potentially fault-revealing input and a set of assertions at least one of which fails. In our experiments, Eclat successfully generated inputs that exposed fault-revealing behavior; we have used Eclat to reveal real errors in programs. The inputs it selects as fault-revealing are an order of magnitude as likely to reveal a fault as all generated inputs.