Abstract

We propose a multistream multitask deep network for joint human detection and head pose estimation in RGB-D videos. To achieve high accuracy, we jointly utilize appearance, shape, and motion information as inputs. Based on the depth information, we generate scale invariant proposals, which are then fed into a novel contextual region of interest pooling (CRP) layer in our deep network. This CRP has two branches to deal with contextual information for each subject. The proposed method outperforms state-of-the-art approaches on three public datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call