Abstract

Person re-identification (reID) is a crucial aspect of intelligent surveillance systems, enabling the recognition of individuals across non-overlapping camera views. As compared to the RGB modality, RGB-D based reID has the potential to achieve robust and high performance by leveraging rich complementary features of both modalities, making it applicable in various occluded scenarios. However, current multimodal reID approaches often rely on late fusion or feature-level fusion techniques to combine multiple modalities, which limits their ability to proficiently exploit complementary visual and depth-related semantic information and capture complex interactions between unimodal features. To address these limitations, this paper introduces a cross-modality online transformer (CMOT) for RGB-D based person reID with online learning capabilities, which effectively utilizes both RGB and depth modalities for the extraction of spatio-temporal features and fuses across modalities. Our CMOT is composed of three main components: (1) a hypothesis generation module based on a person detector and tracker, (2) dual-stream feature extractors via convolutional neural networks (CNNs), (3) and a fusion transformer based on a self-attention-driven self-attentive modality refinement module (SAMR) and a cross-attention-driven cross-attentive modality interaction module (CAMI) to refine and fuse RGB-D complementary features extracted from the dual-stream, RGB and depth stream, CNNs. Additionally, we introduce a bottleneck enhancement feed-forward block to enhance the model’s representation capability within SAMR and CAMI, significantly reducing parameters and computations compared to the traditional feed-forward network. Moreover, we design the triplet loss function with distance measuring ability for incorporating online learning and finally CMOT works as a few-shot network for reID. Experimental results on three RGB-D person re-identification datasets, namely BIWI RGBD-ID, RobotPKU RGBD-ID, and TVPR2, demonstrate the effectiveness and robustness of CMOT.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call