Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving Systems
Deep Neural Networks (DNNs) for Autonomous Driving Systems (ADS) are typically trained on real-world images and tested using synthetic images from simulators. This approach results in training and test datasets with dissimilar distributions, which can potentially lead to erroneously decreased test accuracy. To address this issue, the literature suggests applying domain-to-domain translators to test datasets to bring them closer to the training datasets. However, translating images used for testing may unpredictably affect the reliability, effectiveness and efficiency of the testing process. Hence, this paper investigates the following questions in the context of ADS: Could translators reduce the effectiveness of images used for ADS-DNN testing and their ability to reveal faults in ADS-DNNs? Can translators result in excessive time overhead during simulation-based testing? To address these questions, we consider three domain-todomain translators: CycleGAN and neural style transfer, from the literature, and SAEVAE, our proposed translator. Our results for two critical ADS tasks -lane keeping and object detection -indicate that translators significantly narrow the gap in ADS test accuracy caused by distribution dissimilarities between training and test data, with SAEVAE outperforming the other two translators. We show that, based on the recent diversity, coverage, and fault-revealing ability metrics for testing deep-learning systems, translators do not compromise the diversity and the coverage of test data nor do they lead to revealing fewer faults in ADS-DNNs. Further, among the translators considered, SAEVAE incurs a negligible overhead in simulation time and can be efficiently integrated into simulation-based testing. Finally, we show that translators increase the correlation between offline and simulation-based testing results, which can help reduce the cost of simulation-based testing. Our replication package is available online [1] .
- Conference Article
200
- 10.1109/sp40001.2021.00076
- May 1, 2021
In Autonomous Driving (AD) systems, perception is both security and safety critical. Despite various prior studies on its security issues, all of them only consider attacks on camera- or LiDAR-based AD perception alone. However, production AD systems today predominantly adopt a Multi-Sensor Fusion (MSF) based design, which in principle can be more robust against these attacks under the assumption that not all fusion sources are (or can be) attacked at the same time. In this paper, we present the first study of security issues of MSF-based perception in AD systems. We directly challenge the basic MSF design assumption above by exploring the possibility of attacking all fusion sources simultaneously. This allows us for the first time to understand how much security guarantee MSF can fundamentally provide as a general defense strategy for AD perception. We formulate the attack as an optimization problem to generate a physically-realizable, adversarial 3D-printed object that misleads an AD system to fail in detecting it and thus crash into it. We propose a novel attack pipeline that addresses two main design challenges: (1) non-differentiable target camera and LiDAR sensing systems, and (2) non-differentiable cell-level aggregated features popularly used in LiDAR-based AD perception. We evaluate our attack on MSF included in representative open-source industry-grade AD systems in real-world driving scenarios. Our results show that the attack achieves over 90% success rate across different object types and MSF. Our attack is also found stealthy, robust to victim positions, transferable across MSF algorithms, and physical-world realizable after being 3D-printed and captured by LiDAR and camera devices. To concretely assess the end-to-end safety impact, we further perform simulation evaluation and show that it can cause a 100% vehicle collision rate for an industry-grade AD system.
- Research Article
7
- 10.1109/tip.2024.3413598
- Jan 1, 2024
- IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Due to the advancement of deep learning, the performance of salient object detection (SOD) has been significantly improved. However, deep learning-based techniques require a sizable amount of pixel-wise annotations. To relieve the burden of data annotation, a variety of deep weakly-supervised and unsupervised SOD methods have been proposed, yet the performance gap between them and fully supervised methods remains significant. In this paper, we propose a novel, cost-efficient salient object detection framework, which can adapt models from synthetic data to real-world data with the help of a limited number of actively selected annotations. Specifically, we first construct a synthetic SOD dataset by copying and pasting foreground objects into pure background images. With the masks of foreground objects taken as the ground-truth saliency maps, this dataset can be used for training the SOD model initially. However, due to the large domain gap between synthetic images and real-world images, the performance of the initially trained model on the real-world images is deficient. To transfer the model from the synthetic dataset to the real-world datasets, we further design an uncertainty-aware active domain adaptive algorithm to generate labels for the real-world target images. The prediction variances against data augmentations are utilized to calculate the superpixel-level uncertainty values. For those superpixels with relatively low uncertainty, we directly generate pseudo labels according to the network predictions. Meanwhile, we select a few superpixels with high uncertainty scores and assign labels to them manually. This labeling strategy is capable of generating high-quality labels without incurring too much annotation cost. Experimental results on six benchmark SOD datasets demonstrate that our method outperforms the existing state-of-the-art weakly-supervised and unsupervised SOD methods and is even comparable to the fully supervised ones. Code will be released at: https://github.com/czh-3/UADA.
- Research Article
12
- 10.3390/rs13122308
- Jun 12, 2021
- Remote Sensing
Unmanned aerial vehicle (UAV) imaging is a promising data acquisition technique for image-based plant phenotyping. However, UAV images have a lower spatial resolution than similarly equipped in field ground-based vehicle systems, such as carts, because of their distance from the crop canopy, which can be particularly problematic for measuring small-sized plant features. In this study, the performance of three deep learning-based super resolution models, employed as a pre-processing tool to enhance the spatial resolution of low resolution images of three different kinds of crops were evaluated. To train a super resolution model, aerial images employing two separate sensors co-mounted on a UAV flown over lentil, wheat and canola breeding trials were collected. A software workflow to pre-process and align real-world low resolution and high-resolution images and use them as inputs and targets for training super resolution models was created. To demonstrate the effectiveness of real-world images, three different experiments employing synthetic images, manually downsampled high resolution images, or real-world low resolution images as input to the models were conducted. The performance of the super resolution models demonstrates that the models trained with synthetic images cannot generalize to real-world images and fail to reproduce comparable images with the targets. However, the same models trained with real-world datasets can reconstruct higher-fidelity outputs, which are better suited for measuring plant phenotypes.
- Conference Article
8
- 10.1109/icip46576.2022.9897599
- Oct 16, 2022
Excellent text recognition results have been obtained by training recognition models with synthetic images. However, recognizing text from real-world images still faces challenges due to the domain shift between synthetic and real-world text images. One strategy to eliminate this domain difference without manual annotation is unsupervised domain adaptation (UDA). Due to the characteristics of sequential labeling tasks, most popular UDA methods cannot be directly applied to text recognition. To tackle this problem, we proposed a UDA method that minimizes latent entropy on sequence-to-sequence attention-based models with class-balanced self-paced learning. Experimental results show that our proposed framework achieves better recognition results than the existing methods on most UDA text recognition benchmarks. All codes are publicly available <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> .
- Research Article
8
- 10.3390/machines12020099
- Feb 1, 2024
- Machines
We present a diverse dataset of industrial metal objects with unique characteristics such as symmetry, texturelessness, and high reflectiveness. These features introduce challenging conditions that are not captured in existing datasets. Our dataset comprises both real-world and synthetic multi-view RGB images with 6D object pose labels. Real-world data were obtained by recording multi-view images of scenes with varying object shapes, materials, carriers, compositions, and lighting conditions. This resulted in over 30,000 real-world images. We introduce a new public tool that enables the quick annotation of 6D object pose labels in multi-view images. This tool was used to provide 6D object pose labels for all real-world images. Synthetic data were generated by carefully simulating real-world conditions and varying them in a controlled and realistic way. This resulted in over 500,000 synthetic images. The close correspondence between synthetic and real-world data and controlled variations will facilitate sim-to-real research. Our focus on industrial conditions and objects will facilitate research on computer vision tasks, such as 6D object pose estimation, which are relevant for many industrial applications, such as machine tending. The dataset and accompanying resources are available on the project website.
- Research Article
2
- 10.1007/s11263-023-01973-w
- Jan 6, 2024
- International Journal of Computer Vision
In this paper, we study the problem of unsupervised object segmentation from single images. We do not introduce a new algorithm, but systematically investigate the effectiveness of existing unsupervised models on challenging real-world images. We first introduce seven complexity factors to quantitatively measure the distributions of background and foreground object biases in appearance and geometry for datasets with human annotations. With the aid of these factors, we empirically find that, not surprisingly, existing unsupervised models fail to segment generic objects in real-world images, although they can easily achieve excellent performance on numerous simple synthetic datasets, due to the vast gap in objectness biases between synthetic and real images. By conducting extensive experiments on multiple groups of ablated real-world datasets, we ultimately find that the key factors underlying the failure of existing unsupervised models on real-world images are the challenging distributions of background and foreground object biases in appearance and geometry. Because of this, the inductive biases introduced in existing unsupervised models can hardly capture the diverse object distributions. Our research results suggest that future work should exploit more explicit objectness biases in the network design.
- Single Report
- 10.2172/1670874
- Sep 1, 2020
Deep convolutional neural networks (DCNNs) currently provide state-of-the-art performance on image classification and object detection tasks, and there are many global security mission areas where such models could be extremely useful. Crucially, the success of these models is driven in large part by the widespread availability of high-quality open source data sets such as Image Net, Common Objects in Context (COCO), and KITTI, which contain millions of images with thousands of unique labels. However, global security relevant objects-of-interest can be difficult to obtain: relevant events are low frequency and high consequence; the content of relevant images is sensitive; and adversaries and proliferators seek to obscure their activities. For these cases where exemplar data is hard to come-by, even fine-tuning an existing model with available data can be effectively impossible. Recent work demonstrated that models can be trained using a combination of real-world and synthetic images generated from 3D representations; that such models can exceed the performance of models trained using real-world data alone; and that the generated images need not be perfectly realistic (Tremblay, et al., 2018). However, this approach still required hundreds to thousands of real-world images for training and fine tuning, which for sparse, global security-relevant datasets can be an unrealistic hurdle. In this research, we validate the performance and behavior of DCNN models as we drive the number of real-world images used for training object detection tasks down to a minimal set. We perform multiple experiments to identify the best approach to train DCNNs from an extremely small set of real-world images. In doing so, we: Develop state-of-the-art, parameterized 3D models based on real-world images and sample from their parameters to increase the variance in synthetic image training data; Use machine learning explainability techniques to highlight and correct through targeted training the biases that result from training using completely synthetic images; and Validate our results by comparing the performance of the models trained on synthetic data to one another, and to a control model created by fine-tuning an existing ImageNet-trained model with a limited number (hundreds) of real-world images.
- Research Article
70
- 10.1007/s11263-020-01416-w
- Jan 30, 2021
- International Journal of Computer Vision
The capability of image deraining is a highly desirable component of intelligent decision-making in autonomous driving and outdoor surveillance systems. Image deraining aims to restore the clean scene from the degraded image captured in a rainy day. Although numerous single image deraining algorithms have been recently proposed, these algorithms are mainly evaluated using certain type of synthetic images, assuming a specific rain model, plus a few real images. It remains unclear how these algorithms would perform on rainy images acquired “in the wild” and how we could gauge the progress in the field. This paper aims to bridge this gap. We present a comprehensive study and evaluation of existing single image deraining algorithms, using a new large-scale benchmark consisting of both synthetic and real-world rainy images of various rain types. This dataset highlights diverse rain models (rain streak, rain drop, rain and mist), as well as a rich variety of evaluation criteria (full- and no-reference objective, subjective, and task-specific). We further provide a comprehensive suite of criteria for deraining algorithm evaluation, including full- and no-reference metrics, subjective evaluation, and the novel task-driven evaluation. The proposed benchmark is accompanied with extensive experimental results that facilitate the assessment of the state-of-the-arts on a quantitative basis. Our evaluation and analysis indicate the gap between the achievable performance on synthetic rainy images and the practical demand on real-world images. We show that, despite many advances, image deraining is still a largely open problem. The paper is concluded by summarizing our general observations, identifying open research challenges and pointing out future directions. Our code and dataset is publicly available at http://uee.me/ddQsw .
- Research Article
2
- 10.3390/electronics13020250
- Jan 5, 2024
- Electronics
Most real-world super-resolution methods require synthetic image pairs for training. However, the frequency domain gap between synthetic images and real-world images leads to artifacts and blurred reconstructions. This work points out that the main reason for the frequency domain gap is that aliasing exists in real-world images, but the degradation model used to generate synthetic images ignores the impact of aliasing on images. Therefore, a method is proposed in this work to assess aliasing in images undergoing unknown degradation by measuring the distance to their alias-free counterparts. Leveraging this assessment, a domain-translation framework is introduced to learn degradation from high-resolution to low-resolution images. The proposed framework employs a frequency-domain branch and loss function to generate synthetic images with aliasing features. Experiments validate that the proposed domain-translation framework enhances the visual quality and quantitative results compared to existing super-resolution models across diverse real-world image benchmarks. In summary, this work offers a practical solution to the real-world super-resolution problem by minimizing the frequency domain gap between synthetic and real-world images.
- Research Article
10
- 10.3390/s23020621
- Jan 5, 2023
- Sensors (Basel, Switzerland)
Semantic image segmentation is a core task for autonomous driving, which is performed by deep models. Since training these models draws to a curse of human-based image labeling, the use of synthetic images with automatically generated labels together with unlabeled real-world images is a promising alternative. This implies addressing an unsupervised domain adaptation (UDA) problem. In this paper, we propose a new co-training procedure for synth-to-real UDA of semantic segmentation models. It performs iterations where the (unlabeled) real-world training images are labeled by intermediate deep models trained with both the (labeled) synthetic images and the real-world ones labeled in previous iterations. More specifically, a self-training stage provides two domain-adapted models and a model collaboration loop allows the mutual improvement of these two models. The final semantic segmentation labels (pseudo-labels) for the real-world images are provided by these two models. The overall procedure treats the deep models as black boxes and drives their collaboration at the level of pseudo-labeled target images, i.e., neither modifying loss functions is required, nor explicit feature alignment. We test our proposal on standard synthetic and real-world datasets for onboard semantic segmentation. Our procedure shows improvements ranging from approximately 13 to 31 mIoU points over baselines.
- Conference Article
1
- 10.1109/icpr.2018.8545367
- Aug 1, 2018
Intrinsic Image Decomposition (IID) refers to recovering the albedo and shading from images, and it plays important roles in addressing computer vision tasks such as illumination-invariant object recognition and image recoloring. IID is an ill-posed problem and lacks of actual labelled samples for learning. This paper presents a deep neural network (DNN) based method to address this problem. To facilitate the training of DNN, we synthesize an intrinsic image dataset through rendering $\pmb 3\mathbf{D}$ models. To make the learnt model well generalize to realworld images with better visual results, we employ the perceptual loss in model learning. The perceptual loss is constructed upon the activations of a neural network pre-trained on real images, such as the VGG network. Such a loss function implicitly introduces knowledge from real-world images and has the ability of multi-level semantic understanding on the decomposed results. Experimental results show that our model trained on synthetic single-object dataset can produce well decomposition results not only on synthetic images but also on real-world scene-level images (containing multiple objects).
- Conference Article
22
- 10.14722/ndss.2022.24177
- Jan 1, 2022
In high-level Autonomous Driving (AD) systems, behavioral planning is in charge of making high-level driving decisions such as cruising and stopping, and thus highly securitycritical. In this work, we perform the first systematic study of semantic security vulnerabilities specific to overly-conservative AD behavioral planning behaviors, i.e., those that can cause failed or significantly-degraded mission performance, which can be critical for AD services such as robo-taxi/delivery. We call them semantic Denial-of-Service (DoS) vulnerabilities, which we envision to be most generally exposed in practical AD systems due to the tendency for conservativeness to avoid safety incidents. To achieve high practicality and realism, we assume that the attacker can only introduce seemingly-benign external physical objects to the driving environment, e.g., off-road dumped cardboard boxes. To systematically discover such vulnerabilities, we design PlanFuzz, a novel dynamic testing approach that addresses various problem-specific design challenges. Specifically, we propose and identify planning invariants as novel testing oracles, and design new input generation to systematically enforce problemspecific constraints for attacker-introduced physical objects. We also design a novel behavioral planning vulnerability distance metric to effectively guide the discovery. We evaluate PlanFuzz on 3 planning implementations from practical open-source AD systems, and find that it can effectively discover 9 previouslyunknown semantic DoS vulnerabilities without false positives. We find all our new designs necessary, as without each design, statistically significant performance drops are generally observed. We further perform exploitation case studies using simulation and real-vehicle traces. We discuss root causes and potential fixes.
- Conference Article
414
- 10.1109/cvpr.2018.00019
- Jun 1, 2018
This paper addresses 2 challenging tasks: improving the quality of low resolution facial images and accurately locating the facial landmarks on such poor resolution images. To this end, we make the following 5 contributions: (a) we propose Super-FAN: the very first end-to-end system that addresses both tasks simultaneously, i.e. both improves face resolution and detects the facial landmarks. The novelty or Super-FAN lies in incorporating structural information in a GAN-based super-resolution algorithm via integrating a sub-network for face alignment through heatmap regression and optimizing a novel heatmap loss. (b) We illustrate the benefit of training the two networks jointly by reporting good results not only on frontal images (as in prior work) but on the whole spectrum of facial poses, and not only on synthetic low resolution images (as in prior work) but also on real-world images. (c) We improve upon the state-of-the-art in face super-resolution by proposing a new residual-based architecture. (d) Quantitatively, we show large improvement over the state-of-the-art for both face super-resolution and alignment. (e) Qualitatively, we show for the first time good results on real-world low resolution images.
- Conference Article
1323
- 10.1109/cvpr.2017.186
- Jul 1, 2017
We propose a new deep network architecture for removing rain streaks from individual images based on the deep convolutional neural network (CNN). Inspired by the deep residual network (ResNet) that simplifies the learning process by changing the mapping form, we propose a deep detail network to directly reduce the mapping range from input to output, which makes the learning process easier. To further improve the de-rained result, we use a priori image domain knowledge by focusing on high frequency detail during training, which removes background interference and focuses the model on the structure of rain in images. This demonstrates that a deep architecture not only has benefits for high-level vision tasks but also can be used to solve low-level imaging problems. Though we train the network on synthetic data, we find that the learned network generalizes well to real-world test images. Experiments show that the proposed method significantly outperforms state-of-the-art methods on both synthetic and real-world images in terms of both qualitative and quantitative measures. We discuss applications of this structure to denoising and JPEG artifact reduction at the end of the paper.
- Research Article
5
- 10.1109/tcsvt.2023.3269955
- Nov 1, 2023
- IEEE Transactions on Circuits and Systems for Video Technology
Despite remarkable progress in single-image super-resolution based on neural networks, the results are not ideal when applied to real-world images, because the real-world degradation process is unknown and complex. The emergence of some optical zoom datasets shows that neural networks still achieve good results on real-world images as long as the low-resolution images used for training have similar features and distributions with the real-world images. However, obtaining such optical zoom datasets is complicated and the datasets are only applicable to specific cameras and shooting conditions. By studying the optical zoom datasets, we propose a super-resolution image degradation model consisting of blurring, frequency domain processing, adding noise and downsampling. Specifically, blurring uses a blur kernel with a wave-like shape inferred from the point spread function, which produces the artifacts like real-world images. Frequency domain processing simulates the frequency domain aliasing of real-world images, such as jagged edges and background stripes. Experiments demonstrate that the new degradation model achieves visual effects comparable to optical zoom datasets. Existing high-resolution datasets can be converted to "optical zoom datasets" by the degradation model, where the synthetic low-resolution images have real-world image features, thereby extending super-resolution methods to real-world images.