Abstract
Human eye movements while driving reveal that visual attention largely depends on the context in which it occurs. Furthermore, an autonomous vehicle which performs this function would be more reliable if its outputs were understandable. Capsule Networks have been presented as a great opportunity to explore new horizons in the Computer Vision field, due to their capability to structure and relate latent information. In this article, we present a hierarchical approach for the prediction of eye fixations in autonomous driving scenarios. Context-driven visual attention can be modeled by considering different conditions which, in turn, are represented as combinations of several spatio-temporal features. With the aim of learning these conditions, we have built an encoder-decoder network which merges visual features’ information using a global-local definition of capsules. Two types of capsules are distinguished: representational capsules for features and discriminative capsules for conditions. The latter and the use of eye fixations recorded with wearable eye tracking glasses allow the model to learn both to predict contextual conditions and to estimate visual attention, by means of a multi-task loss function. Experiments show how our approach is able to express either frame-level (global) or pixel-wise (local) relationships between features and contextual conditions, allowing for interpretability while maintaining or improving the performance of black-box related systems in the literature. Indeed, our proposal offers an improvement of 29% in terms of information gain with respect to the best performance reported in the literature.
Highlights
T He way contemporary Computer Vision systems represent our world seems progressively further from being understood by humans
In an effort to contribute to visual attention understanding in real settings, we propose a TD system to carry out an autonomous driving task, which is able to offer interpretation about its predictions by means of Capsule Networks [17], [18]
Their performance is significantly lower when we look at Kullback-Leibler Divergence (KL) and Information Gain (IG), which means that probably CC, shuffled Area Under Curve (sAUC) and shuffled variant of Normalized Scanpath Saliency (sNSS) are more saturated metrics, being KL and IG more expressive for our analysis
Summary
T He way contemporary Computer Vision systems represent our world seems progressively further from being understood by humans Both the performance and the complexity of feature learning methods, which derives from the application of Deep Learning (DL) and Convolutional Neural Networks (CNNs) to compelling but challenging vision tasks such as object recognition [1] and tracking [2], or anomaly detection in video surveillance scenarios [3], increase at the same time. It is noteworthy the role of eye movements in visual
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.