Abstract

This paper analyzes in detail how different loss functions influence the generalization abilities of a deep learning-based next frame prediction model for traffic scenes. Our prediction model is a convolutional long-short term memory (ConvLSTM) network that generates the pixel values of the next frame after having observed the raw pixel values of a sequence of four past frames. We trained the model with 21 combinations of seven loss terms using the Cityscapes Sequences dataset and an identical hyper-parameter setting. The loss terms range from pixel-error based terms to adversarial terms. To assess the generalization abilities of the resulting models, we generated predictions up to 20 time-steps into the future for four datasets of increasing visual distance to the training dataset—KITTI Tracking, BDD100K, UA-DETRAC, and KIT AIS Vehicles. All predicted frames were evaluated quantitatively with both traditional pixel-based evaluation metrics, that is, mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM), and recent, more advanced, feature-based evaluation metrics, that is, Fréchet inception distance (FID), and learned perceptual image patch similarity (LPIPS). The results show that solely by choosing a different combination of losses, we can boost the prediction performance on new datasets by up to 55%, and by up to 50% for long-term predictions.

Highlights

  • The ability to predict possible future actions of traffic participants is essential for anticipatory driving

  • In contrast to the traditional metrics, which directly compare the pixel values of two images, the Fréchet inception distance (FID) and the learned perceptual image patch similarity (LPIPS) values measure the distance between two images not in pixel-space, but feature-space

  • We have shown that an intelligently designed loss function is essential for a prediction model to generate plausible frames of traffic scenes

Read more

Summary

Introduction

The ability to predict possible future actions of traffic participants is essential for anticipatory driving. Predictions of probable future events can prove beneficial when used as additional inputs to the system. They can help to plan the action more efficiently and to make decisions more informedly. The learned features of an ideal network for video prediction match both of the following criteria They are generic enough to enable the model to generalize well over a variety of different scene contents. They produce high-quality predictions that preserve details of the observed input scene across multiple prediction steps

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.