Attention Based CNN-ConvLSTM for Pedestrian Attribute Recognition.

Yang Li,Huahu Xu,Minjie Bian,Junsheng Xiao

doi:10.3390/s20030811

Abstract

As a result of its important role in video surveillance, pedestrian attribute recognition has become an attractive facet of computer vision research. Because of the changes in viewpoints, illumination, resolution and occlusion, the task is very challenging. In order to resolve the issue of unsatisfactory performance of existing pedestrian attribute recognition methods resulting from ignoring the correlation between pedestrian attributes and spatial information, in this paper, the task is regarded as a spatiotemporal, sequential, multi-label image classification problem. An attention-based neural network consisting of convolutional neural networks (CNN), channel attention (CAtt) and convolutional long short-term memory (ConvLSTM) is proposed (CNN-CAtt-ConvLSTM). Firstly, the salient and correlated visual features of pedestrian attributes are extracted by pre-trained CNN and CAtt. Then, ConvLSTM is used to further extract spatial information and correlations from pedestrian attributes. Finally, pedestrian attributes are predicted with optimized sequences based on attribute image area size and importance. Extensive experiments are carried out on two common pedestrian attribute datasets, PEdesTrian Attribute (PETA) dataset and Richly Annotated Pedestrian (RAP) dataset, and higher performance than other state-of-the-art (SOTA) methods is achieved, which proves the superiority and validity of our method.

Highlights

IntroductionAs pedestrians are one of the important targets of video surveillance, the recognition of pedestrian visual attributes (such as gender, age and clothing style, etc.) has become an important task [1]
Along with the fast development of video surveillance networks, using computer technology to realize the intelligence of video surveillance systems has become a hot research area.As pedestrians are one of the important targets of video surveillance, the recognition of pedestrian visual attributes has become an important task [1].As an intermediate semantic feature, pedestrian visual attributes are robust to viewpoint changes and observation conditions
A novel attention based neural network model (CNN-channel attention (CAtt)-ConvLSTM) is proposed to fully mine the semantic correlation and spatial information of pedestrian attributes to improve the performance of pedestrian attribute recognition

Summary

Introduction

As pedestrians are one of the important targets of video surveillance, the recognition of pedestrian visual attributes (such as gender, age and clothing style, etc.) has become an important task [1]. As an intermediate semantic feature, pedestrian visual attributes are robust to viewpoint changes and observation conditions. They can establish the relationship between low-level visual features and high-level cognition, and assist in many visual tasks such as human face recognition [2], person re-identification [3,4], person retrieval [5,6] and human identification [7]. Pedestrian attribute recognition is a great challenge, and the main difficulties are: (1) poor image quality, low resolution, occlusion, motion blur, etc.; (2) some pedestrian attribute recognition tasks require local fine-grained information, such as “glasses”; (3) attribute appearance and spatial location are prone to change, such as different types of “bag”; (4) lack of large-scale datasets and unbalanced sample distribution.

Methods

Results

Conclusion