On the Road With 16 Neurons: Towards Interpretable and Manipulable Latent Representations for Visual Predictions in Driving Scenarios

Alice Plebe,Mauro Da Lio

doi:10.1109/access.2020.3028185

Alice Plebe, Mauro Da Lio

Open Access

https://doi.org/10.1109/access.2020.3028185

Copy DOI

Abstract

This paper proposes a strategy for visual perception in the context of autonomous driving. Humans, when not distracted or drunk, are still the best drivers you can currently find. For this reason, we take inspiration from two theoretical ideas about the human mind and its neural organization. The first idea concerns how the brain uses structures of neuron ensembles that expand and compress information to extract abstract concepts from visual experience and code them into compact representations. The second idea suggests that these neural perceptual representations are not neutral but functional to predicting the future state of affairs in the environment. Similarly, the prediction mechanism is not neutral but oriented to the planning of future action. We identify within the deep learning framework two artificial counterparts of the aforementioned neurocognitive theories. We find a correspondence between the first theoretical idea and the architecture of convolutional autoencoders, while we translate the second theory into a training procedure that learns compact representations which are not neutral but oriented to driving tasks, from two distinct perspectives. From a static perspective, we force separate groups of neural units in the compact representations to represent specific concepts crucial to the driving task distinctly. From a dynamic perspective, we bias the compact representations to predict how the current road scenario will change in the future. We successfully learn compact representations that use as few as 16 neural units for each of the two basic driving concepts we consider: cars and lanes . We maintain the two concepts separated in the latent space to facilitate the interpretation and manipulation of the perceptual representations. The source code for this paper is available at https://github.com/3lis/rnn_vae .

Highlights

Road traffic injuries are the leading cause of death for the age group between 5 and 29 years [1]
The representations must bear a semantic explanation, i.e., parts of the latent space are associated with concepts useful in the context of driving – cars and lanes in this case, but the work is open to further extensions such as pedestrians or bikes. The model learns these meaningful representations by exploiting semantic segmentation as a supporting task, as we will show in §IV-B, using a multi-decoder network which forces the partitioning of the internal representations into distinct concepts
Several more recent proposals suggest the inclusion of intermediate representations, such as the so-called mid-to-mid strategy used in ChauffeurNet [66], Waymo’s autonomous driving system

Summary

INTRODUCTION

Road traffic injuries are the leading cause of death for the age group between 5 and 29 years [1]. The representations must bear a semantic explanation, i.e., parts of the latent space are associated with concepts useful in the context of driving – cars and lanes in this case, but the work is open to further extensions such as pedestrians or bikes The model learns these meaningful representations by exploiting semantic segmentation as a supporting task, as we will show in §IV-B, using a multi-decoder network which forces the partitioning of the internal representations into distinct concepts. Learning the entire range of road scenarios from steering supervision alone, considering all possible appearances of objects relevant to the drive, is not achievable in practical settings For this reason, several more recent proposals suggest the inclusion of intermediate representations, such as the so-called mid-to-mid strategy used in ChauffeurNet [66], Waymo’s autonomous driving system. We will show in IV-B how our partitioning of the latent spaces differs from these approaches

THE NEURAL MODELS

NET3: TEMPORAL AUTOENCODER

RESULTS