Abstract
In this paper, we propose to learn recurrent 3D attention (A3D) for video-based person re-identification. Attention model plays a key role in both spatial and temporal domains for video representation. Most existing methods apply spatial attention model to extract feature from a single image and aggregate image features with attentive temporal pooling or RNN. However, the inherent consistencies and correlations between spatial and temporal clues are not leveraged. Our A3D method aims to utilize the joint constraints of temporal and spatial attentions to enhance the robustness of attention model. Towards this goal, we treat the pedestrian video as a unified 3D bin where the temporal domain is denoted as an additional dimension. Then we develop an attention agent to iteratively select the locations of the salient spatial-temporal parts in the 3D bin. In addition, we formulate our sequential 3D attention learning as a Markov Decision Process and train the representation network and attention detector with the policy gradient method in an end-to-end manner. We evaluate the proposed method on three challenging datasets including iLIDS-VID, PRID-2011 and the large-scale MARS dataset, and consistently improve the performance in comparison with the state-of-the-art methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have