Human action recognition is an important branch of computer vision science. It is a challenging task based on skeletal data because joints have complex spatiotemporal information. In this work, we propose a method for action recognition, which consists of three parts: view-independent representation, combination with cumulative Euclidean distance, and combined model. First, the action sequence becomes view-independent representations independent of the view. Second, these representations are further combined with cumulative Euclidean distances, so the joints more closely associated with the action are emphasised. Then, a combined model is adopted to extract these representation features and classify actions. It consists of two parts, a regular three-layer BLSTM network, and a temporal attention module. Experimental results on two multi-view benchmark datasets, Northwestern-UCLA and NTU RGB + D, demonstrate the effectiveness of our complete method. Despite its simple architecture and the use of only one type of action feature, it can still significantly improve recognition performance and has strong robustness.
Read full abstract