Pedestrian attribute recognition (PAR) aims at predicting the visual attributes of a pedestrian image. PAR has been used as soft biometrics for visual surveillance and IoT security. Most of the current PAR methods are developed based on discrete images. However, it is challenging for the image-based method to handle the occlusion and action-related attributes in real-world applications. Recently, video-based PAR has attracted much attention in order to exploit the temporal cues in the video sequences for better PAR. Unfortunately, existing methods usually ignore the correlations among different attributes and the relations between attributes and spatio regions. To address this problem, we propose a novel method for video-based PAR by exploring the relationships among different attributes in both the spatio and temporal domains. More specifically, a spatio-temporal saliency module (STSM) is introduced to capture the key visual patterns from the video sequences, and a module for spatio-temporal attribute relationship learning (STARL) is proposed to mine the correlations among these patterns. Meanwhile, a large-scale benchmark for video-based PAR, RAP-Video, is built by extending the image-based dataset RAP-2, which contains 83,216 tracklets with 25 scenes. To the best of our knowledge, this is the largest dataset for video-based PAR. Extensive experiments are performed on the proposed benchmark as well as on MARS Attribute and DukeMTMC-Video Attribute. The superior performance demonstrates the effectiveness of the proposed method.
Read full abstract