Abstract

Light Field Salient Object Detection (LF SOD) aims to segment the visually distinctive objects out of surroundings. Since light field images provide a multi-focus stack (many focal slices in different depth levels) and an all-focus image for the same scene, they record comprehensive but redundant information. Existing methods exploit the useful cue by long short-term memory with attention mechanism, 3D convolution, and graph learning. However, the importance of intra-slice and inter-slice in the focal stack is not well investigated. In the paper, we propose a learnable weight descriptor to simultaneously exploit different weights in slice, spatial region, and channel dimensions, and therefore propose an LF SOD method based on the learnable descriptor. The method extracts slice features and all-focus features from a weight-shared backbone and another backbone, respectively. A transformer decoder is used to learn the weight descriptor which both emphasizes the importance of each slice (inter-slice) and discriminates the spatial and channel importance of each slice (intra-slice). The learnt descriptor serves as the weight to make slice features attend to important slices, regions, and channels. Furthermore, we propose the hierarchical multi-modal fusion which aggregates high-layer features by modelling the long-range dependency to fully excavate common salient semantics and combines low-layer features by spatial constraint to eliminate the blurring effect of slice features. The experimental result exceeds the state-of-the-art methods at least 25% in terms of mean absolute error evaluation metric. It demonstrates a significant improvement in LF SOD performance via the designed learnable weight descriptor. https://github.com/liuzywen/LFTransNet.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call