Pedestrian segmentation based on a spatio-temporally consistent graph-cut with optimal transport

Yasushi Yagi,Yang Yu,Yasushi Makihara

doi:10.1186/s41074-019-0062-2

Abstract

We address a method of pedestrian segmentation in a video in a spatio-temporally consistent way. For this purpose, given a bounding box sequence of each pedestrian obtained by a conventional pedestrian detector and tracker, we construct a spatio-temporal graph on a video and segment each pedestrian on the basis of a well-established graph-cut segmentation framework. More specifically, we consider three terms as an energy function for the graph-cut segmentation: (1) a data term, (2) a spatial pairwise term, and (3) a temporal pairwise term. To maintain better temporal consistency of segmentation even under relatively large motions, we introduce a transportation minimization framework that provides a temporal correspondence. Moreover, we introduce the edge-sticky superpixel to maintain the spatial consistency of object boundaries. In experiments, we demonstrate that the proposed method improves segmentation accuracy indices, such as the average and weighted intersection of union on TUD datasets and the PETS2009 dataset at both the instance level and semantic level.

Highlights

1 Introduction Silhouette extraction or human body segmentation is widely conducted as the first step in many high-level computer vision tasks of video surveillance systems, such as human tracking [1–4], human action recognition [5–8] and gait-based identification and recognition [9–11]
We demonstrate that the proposed method improves the performance of pedestrian silhouette extraction at both the instance level and semantic level on public datasets compared with state-of-the-art methods
4.3 Experimental results 4.3.1 Instance-level evaluation The instance-level experimental result is presented in Table 2 while examples of visualization mask-type and edge-type results are respectively shown in Fig. 11 and Fig. 12

Summary

Introduction

Silhouette extraction or human body segmentation is widely conducted as the first step in many high-level computer vision tasks of video surveillance systems, such as human tracking [1–4], human action recognition [5–8] and gait-based identification and recognition [9–11]. A typical approach of supervised pedestrian silhouette extraction requires a manually annotated mask of the target in the first frame and propagates the mask frame by frame. State-of-the-art superpixel segmentation methods (e.g., the SLIC superpixel [27] and superpixels extracted via energy-driven sampling (SEEDS) superpixel [29]) provide a balance between appearance and shape regularity, and usually perform well in computer vision tasks This balance between appearance and shape regularity does not always guarantee that the object boundary is well preserved. Our ultimate target is to extract pedestrians’ silhouettes, and we need to adopt a superpixel segmentation method that better preserves object boundaries. Work [12, 13] is to manually annotate the target’s frame segmentation methods can be mask in the first frame and propagate the target extended to pedestrian silhouette extraction in video mask to other frames. Because the mask annotation has a manual burden, it is difficult to apply supervised methods to pedestrian silhouette extraction in an automatic surveillance system

Proposed method

Superpixel-wise labeling

Spatial pairwise term The spatial pairwise term

Experimental setting

Temporal pairwise term

Conclusion