Language and vision based person re-identification for surveillance systems using deep learning with LIP layers

Seungmin Rho,Muazzam Maqsood,Sadaf Yasmin,Jehyeok Rew,Maryam Bukhari,Sheneela Naz

doi:10.1016/j.imavis.2023.104658

Abstract

Real-time surveillance systems have become a necessity of today's life owing to their relevance in the contemporary era for security reasons to ensure a secure and safe environment. Presently, Person re-identification (Re-ID)-based surveillance systems are becoming increasingly more prevalent and sophisticated since they do not require human intervention and are more reliable to deploy in public spaces leveraging multi-camera networks. However, one of the major problems in Person ReID is the visual appearance i-e the appearance of a person in an image is greatly affected by different camera views. As a result, the discriminative set of features must be learned in a deep learning model in order to re-identify persons from opposing camera viewpoints. To address this challenge, we propose an image/text-retrieval-based Person ReId method in which both visual and text-based features are exploited to carry out person re-identification. More precisely, the textual descriptions of the images are taken into account as text features with Glove Word Embedding followed by 1D-MAPCNN and fused with image-level features extracted using the GoogLeNet model. In addition, the feature discriminability is enhanced using local importance-based pooling (LIP) layers in which adaptive significance weights are learned during downsampling. Moreover, from two different modalities, feature refinement is done during training with the help of attention mechanisms using the Convolutional Block Attention module (CBAM) and the proposed shared attention neural network. It is observed that LIP layers along with both vision and textual features are playing a key role in acquiring discriminative features even if the visual appearance of the same person is greatly affected due to camera pose conditions. The proposed method is validated on the CUHK-PADES dataset and has 15.34% and 24.39% rank-1 improvement in text and image-based retrievals.

Full Text