Jointly human semantic parsing and attribute recognition with feature pyramid structure in EfficientNets

Mahnaz Moghaddam,Hossein Hassanpoor,Mostafa Charmi

doi:10.1049/ipr2.12195

Abstract

Pedestrian attributes recognition is an important issue in computer vision and has a special role in the field of video surveillance. The previous methods presented to solve this issue are mainly based on multi-label end-to-end deep neural networks. These methods neglect to apply attributes for defining local feature areas and they suffer from the problems of the bounding box presence. A new framework for jointly human semantic parsing and pedestrian attribute recognition to achieve effective attribute recognition is proposed. By extracting human parts via semantic parsing, both semantic and spatial information can be explored with eliminating of background. The framework also uses multi-scale features to employ rich details and contextual information through proposed attribute recognition-bidirectional feature pyramid network. For baseline network that has a significant impact on the performance, EfficientNet-B3 is selected as a baseline network from The EfficientNet family which provides an appropriate trade-off between the three factors of CNNs scaling (depth/width/resolution). Finally, the proposed framework is tested on datasets PETA, RAP and PA-100k. Experimental results show that our method has superior performance in both mean accuracy and instance-based metrics compared to state-of-the-art results.

Full Text