Abstract

In this paper, we propose deep metric learning for human action recognition with SlowFast networks. We adopt SlowFast Networks to extract slow-changing spatial semantic information of a single target entity in the spatial domain with fast-changing motion information in the temporal domain. Since deep metric learning is able to learn the class difference between human actions, we utilize deep metric learning to learn a mapping from the original video to the compact features in the embedding space. The proposed network consists of three main parts: 1) two branches independently operating at low and high frame rates to extract spatial and temporal features; 2) feature fusion of the two branches; 3) joint training network of deep metric learning and classification loss. Experimental results on the KTH human action dataset demonstrate that the proposed method achieves faster runtime with less model size than C3D and R3D, while ensuring high accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call