Joint Attentive Spatial-Temporal Feature Aggregation for Video-Based Person Re-Identification

Lin Chen,Zhiyong Gao,Hua Yang

doi:10.1109/access.2019.2907274

Abstract

Video-based person re-identification (Re-ID) remains to be a promising but challenging computer vision task, suffering from a lack of discriminative features that better aggregate both the spatial and temporal information. In this paper, we propose a joint attentive spatial-temporal feature aggregation network (JAFN) for the video-based person Re-ID, simultaneously learning the quality- and frame-aware model to obtain attention-based spatial-temporal feature aggregation. Specifically, we utilize CNN to learn the spatial features, while introducing the LSTM to separately learn the temporal features. For the feature aggregation, we introduce two attention mechanisms respectively for generating the quality and frame significance score, where the quality score measures the quality of the images for attentive spatial feature aggregation, and the frame score measures the significance of the image frames contributing to the temporal feature. Then, we utilize the set-pooling for both the quality-aware spatial feature and the frame-aware temporal feature aggregation based on the attentive scores. The residual learning is also introduced to play between the LSTM and the CNN for adaptive spatial-temporal feature fusion. Furthermore, we adopt the data balance to alleviate the data disproportions existing in datasets of the video-based Re-ID. The extensive experimental results conducted on the PRID2011, i-LIDS-VID, and MARS datasets demonstrate the effectiveness of the proposed JAFN. Furthermore, comparison results conducted on different modules and features in the JAFN show that our approach is of favorable generalization ability on attentively aggregating both the spatial and temporal features.

Full Text