Abstract

Video-based person re-identification (Re-ID) aims to retrieve the target person from video sequences captured by a distributed camera system. It remains a challenging task due to the reasons such as occlusion and misalignment in the video. To address above problem, many methods are proposed to exploit multi-scale spatio-temporal features in videos. However, established methods typically assign the equal weights to temporal or spatial features at different scales, which significantly diminishes the distinct roles of each feature. In this paper, we propose a novel Multi-scale Feature Aggregation Network (MFANet) for video-based person Re-ID. Specifically, we propose two flexible modules, Multi-scale Temporal Feature Aggregation (MTFA) and Multi-scale Spatial Feature Aggregation (MSFA). These two modules first extract different scales of temporal (dynamic, static) and spatial (coarse and fine) features, and then adaptively assign weights to each feature according to the video sequence. Both of these lightweight modules can be incorporated with 3D Convolutional Neural Network to build our MFANet. Extensive experiments on four public benchmarks demonstrate that MTFA and MSFA improve the performance of baseline architectures, and our MFANet achieves the best performance compared to other state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.