Abstract

Recent years have witnessed the increasing attentions on human action recognition(HAR). Traditional methods are prone to explore the optimum spatiotemporal feature representation of human actions in video clips so as to achieve high recognition performance. However, the optical limitations, such as inappropriate view, dim illumination and object occlusion, usually degrade video quality and affect the recognition performance a lot. Considering that wireless signals are robust against optical limitations, we thus incorporate WiFi signals with video streams for HAR. Specifically, we use WiFi Channel State Information as a compensator for video streams. A great challenge is how to effectively fuse the video and WiFi information to achieve better prediction performance. To this end, we employ convolution neural networks and statistic analysis algorithms to extract video and WiFi features respectively, and propose a novel multi-modal learning approach for video and WiFi feature fusion, where the video and WiFi features are projected to a common space by supervised learning. The experimental results indicate that the recognition precision of human actions in videos improved obviously with the aid of WiFi signals and the proposed multi-modal learning approach rivals the state of art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call