Abstract

This paper addresses the issue of analyzing social interactions between humans in videos. We focus on recognizing dyadic human interactions through multi-modal data, specifically, depth, color and skeleton sequences. Firstly, we introduce a new person-centric proxemic descriptor, named PROF, extracted from skeleton data able to incorporate intrinsic and extrinsic distances between two interacting persons in a view-variant scheme. Then, a novel key frame selection approach is introduced to identify salient instants of the interaction sequence based on the joint energy. From RGBD videos, more holistic CNN features are extracted by applying an adaptive pre-trained CNNs on optical flow frames. Features from three modalities are combined then classified using linear SVM. Finally, extensive experiments have been carried on two multi-modal and multi-view interactions datasets prove the robustness of the introduced approach comparing to state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call