Abstract

Pedestrian detection through Computer Vision is a building block for a multitude of applications. Recently, there has been an increasing interest in convolutional neural network-based architectures to execute such a task. One of these supervised networks’ critical goals is to generalize the knowledge learned during the training phase to new scenarios with different characteristics. A suitably labeled dataset is essential to achieve this purpose. The main problem is that manually annotating a dataset usually requires a lot of human effort, and it is costly. To this end, we introduce ViPeD (Virtual Pedestrian Dataset), a new synthetically generated set of images collected with the highly photo-realistic graphical engine of the video game GTA V (Grand Theft Auto V), where annotations are automatically acquired. However, when training solely on the synthetic dataset, the model experiences a Synthetic2Real domain shift leading to a performance drop when applied to real-world images. To mitigate this gap, we propose two different domain adaptation techniques suitable for the pedestrian detection task, but possibly applicable to general object detection. Experiments show that the network trained with ViPeD can generalize over unseen real-world scenarios better than the detector trained over real-world data, exploiting the variety of our synthetic dataset. Furthermore, we demonstrate that with our domain adaptation techniques, we can reduce the Synthetic2Real domain shift, making the two domains closer and obtaining a performance improvement when testing the network over the real-world images.

Highlights

  • A key task in many intelligent video surveillance systems is pedestrian detection, as it provides essential information for semantic understanding of video

  • We introduce and make publicly available ViPeD, a new vast synthetic dataset suitable for the pedestrian detection task, generating the images using photo-realistic video game GTA V

  • We addressed the pedestrian detection task by proposing a Convolutional Neural Networks (CNNs)-based solution trained using synthetically generated data

Read more

Summary

Introduction

A key task in many intelligent video surveillance systems is pedestrian detection, as it provides essential information for semantic understanding of video. Since manually annotating new collections of images is expensive and requires a great human effort, a recently promising approach is to gather data from virtual world environments that mimics as much as possible all the characteristics of the real-world scenarios, and where the annotations can be acquired with a partially automated process To this end, in this work, we provide ViPeD (Virtual Pedestrian Dataset), a new synthetic dataset generated with the highly photo-realistic graphical engine of the video game GTA V (Grand Theft Auto V) by Rockstar. We introduce and make publicly available ViPeD, a new vast synthetic dataset suitable for the pedestrian detection task, generating the images using photo-realistic video game GTA V (Grand Theft Auto V), that extends the JTA (Joint Track Auto) dataset presented in [9]. The code, the models, and the dataset are made freely available at https://ciampluca.github.io/viped/

Pedestrian Detection
Synthetic2Real Domain Adaptation
Training with Synthetic Datasets
Domain Adaptation for Synthetic2Real Pedestrian Detection
Faster R-CNN Object Detector
Domain Adaptation Using Real-World Fine-Tuning
Domain Adaptation using Balanced Gradient Contribution
Experimental Evaluation
Real-World Datasets
Experiments
Testing Generalization Capabilities
Testing Domain Adaptation Techniques over Specific Real-World Scenarios
Method
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.