Video classification by fusing two-stream image template classification and pretrained network

Saeedeh Zebhi ,Seyed Mohammad T Almodarresi ,Vahid Abootalebi

doi:10.1117/1.jei.29.5.053011

Abstract

A motion energy image (MEI) is a spatial template that collapses regions of motion into a single image in which more moving pixels are brighter than others. The forward single-step history image (fSHI) is a spatiotemporal template that shows the presence and direction of motion. Each video can be described using these templates. Recently, the popularity of deep learning architectures for human activity recognition encourages us to explore the effectiveness of combining them with these templates. Hence, three new methods are introduced to convert the problem of human activity recognition in video into image templates classification. In method 1, each video is split into N groups of consecutive frames, and the MEI is computed for each group. Transfer learning with the fine-tuning technique is used for classifying these templates. fSHIs or spatiotemporal templates are used for classification in method 2, similar to method 1. Fusing the two streams of these templates is introduced as method 3. Among these methods, method 3 outperforms the others and is called the proposed method. It achieves recognition accuracies of 92.60% and 93.40% for the UCF Sport and UCF-11 action datasets, respectively. Also, it is compared with state-of-the-art approaches, and the results show that the proposed method has the best performance.

Full Text