Abstract
Human visual Attention modelling is a persistent interdisciplinary research challenge, gaining new interest in recent years mainly due to the latest developments in deep learning. That is particularly evident in saliency benchmarks. Novel deep learning-based visual saliency models show promising results in capturing high-level (top-down) human visual attention processes. Therefore, they strongly differ from the earlier approaches, mainly characterised by low-level (bottom-up) visual features. These developments account for innate human selectivity mechanisms that are reliant on both high- and low-level factors. Moreover, the two factors interact with each other. Motivated by the importance of these interactions, in this project, we tackle visual saliency modelling holistically, examining if we could consider both high- and low-level features that govern human attention. Specifically, we propose a novel method SAtSal (Self-Attention Saliency). SAtSal leverages both high and low-level features using a multilevel merging of skip connections during the decoding stage. Consequently, we incorporate convolutional self-attention modules on skip connection from the encoder to the decoder network to properly integrate the valuable signals from multilevel spatial features. Thus, the self-attention modules learn to filter out the latent representation of the salient regions from the other irrelevant information in an embedded and joint manner with the main encoder-decoder model backbone. Finally, we evaluate SAtSal against various existing solutions to validate our approach, using the well-known standard saliency benchmark MIT300. To further examine SAtSal’s robustness on other image types, we also evaluate it on the Le-Meur saliency painting benchmark.
Highlights
Visual attention consists of perceptual and cognitive mechanisms that empower humans to rapidly select and interpret the most interesting parts of a complex visual scene
Each of the ECSA modules is placed just before each decoder block and takes as input both of Zi and H5−i, and transform the intermediate features H5−i into three variables Q, K and V, unlike [75] that incorporate just one layer of 1*1 convolution without activation function, we extend our implementation by a shallow series of activated Convolutions Neural Networks (CNNs) interspersed with down-sampling and upsampling operation, the ECAS module architecture slightly differ corresponding to the ith positional block, because we are extracting an attention vector from the Encoder layer position and inject it into the Decoder which mean that the two vector spaces are not similar, so that a deep transformation need to be applied, taking into consideration the computation efficiency of the whole architecture
KLD evaluates the mutual distribution between the predicted output and the ground truth, Binary Cross-Entropy (BCE) is used for binary classification of each CNN output vector independently, and Normalized Scanpath Saliency (NSS) [67] provides a saliency metric that measures the mean saliency value at ground-truth fixation locations
Summary
Visual attention consists of perceptual and cognitive mechanisms that empower humans to rapidly select and interpret the most interesting parts of a complex visual scene. Selective mechanisms associated with attention work as a ”data prepossessing bottleneck”. Cognitively-driven top-down mechanisms are important in the way humans direct their attention to selected elements, whether they are visual, auditory, olfactory or otherwise [20]. In addition to top-down vs bottom-up dichotomy, visual attention literature distinguishes overt from covert attention. Overt attention relies on eye movements shifting from a location to another of a given visual scene; foveal processing enables capturing high levels of detail from objects of interest while suppressing the
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.