Localizing, identifying, and extracting humans with consistent appearance jointly from a personal photo stream is an important problem and has wide applications. The strong variations in foreground and background and irregularly occurring foreground humans make this realistic problem challenging. Inspired by advancements in object detection, scene understanding, and image cosegmentation, we explore explicit constraints to label and segment human objects rather than other nonhuman objects and “stuff.” We refer to such a problem as multiple human identification and cosegmentation (MHIC). To identify specific human subjects, we propose an efficient human instance detector by combining an extended color line model with a poselet-based human detector. Moreover, to capture high-level human shape information, a novel soft shape cue is proposed. It is initialized by the human detector, then further enhanced through a generalized geodesic distance transform, and finally refined with a joint bilateral filter. We also propose to capture the rich feature context around each pixel by using an adaptive cross-region data structure, which gives a higher discriminative power than a single pixel-based estimation. The high-level object cues from the detector and the shape are then integrated with the low-level pixel cues and midlevel contour cues into a principled conditional random field (CRF) framework, which can be efficiently solved by using fast graph cut algorithms. We evaluate our method over a newly created NTU-MHIC human dataset, which contains 351 images with manually annotated groundtruth segmentation. Both visual and quantitative results demonstrate that our method achieves state-of-the-art performance for the MHIC task.
Read full abstract