Abstract

Language-based person search retrieves images of a target person using natural language description and is a challenging fine-grained cross-modal retrieval task. A novel hybrid attention network is proposed for the task. The network includes the following three aspects: First, a cubic attention mechanism for person image, which combines cross-layer spatial attention and channel attention. It can fully excavate both important midlevel details and key high-level semantics to obtain better discriminative fine-grained feature representation of a person image. Second, a text attention network for language description, which is based on bidirectional LSTM (BiLSTM) and self-attention mechanism. It can better learn the bidirectional semantic dependency and capture the key words of sentences, so as to extract the context information and key semantic features of the language description more effectively and accurately. Third, a cross-modal attention mechanism and a joint loss function for cross-modal learning, which can pay more attention to the relevant parts between text and image features. It can better exploit both the cross-modal and intra-modal correlation and can better solve the problem of cross-modal heterogeneity. Extensive experiments have been conducted on the CUHK-PEDES dataset. Our approach obtains higher performance than state-of-the-art approaches, demonstrating the advantage of the approach we propose.

Highlights

  • In today’s society, video surveillance has become an important means of public security, and thousands of surveillance cameras have been installed in public places

  • Different from most other methods just generating both spatial attention and channel attention based on the highest convolution layer or same layer of the network [12], we proposed a cross-layer cubic attention mechanism which generates spatial attention based on the midlevel network and generates channel attention based on the high-level network, to fully leverage the spatial information and rich details of the midlevel network and rich semantics of the high-level network, so as to get better performance of this fine-grained task

  • Different from other attention mechanism which generates attention based onfor theanconv4 layer of ResNet50 and too abstract to provide sufficient detailspatial and spatial information effective spatial attention to methods generating both SA and CA based on the conv5 layer of ResNet50, we proposed a generates channel attention based on the conv5 layer of

Read more

Summary

Introduction

In today’s society, video surveillance has become an important means of public security, and thousands of surveillance cameras have been installed in public places. The automatic search of interested persons in large-scale video or image database has attracted the increasing attention of researchers. Person reID has great limitations, because it requires that at least one image of the target person can be obtained, but in some actual cases, it may not be able to obtain the image of the target person. In this case, the target person can only be searched in the surveillance video/image database based on the text language description of the target person’s appearance provided by the witness, which is called text-based person search (TBPS)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.