Abstract

Action recognition is one of the most important areas in the computer vision community. Many previous work use two-stream CNN model to obtain both spatial and temporal clues for predicting task. However, two stream are trained separately and combined later by late fusion. This strategy has overlooked the spatial-temporal features interaction. In this paper, we propose new two-stream CNN architectures that are able to learn the relation between two kinds of features. Furthermore, they can be trained end-to-end with standard back propagation algorithm. We also introduce a Fisher loss that makes features more discriminative. The experiments show that Fisher loss yields higher accuracy than using only the softmax loss.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call