Abstract

This paper presents a novel framework for human action recognition based on salient object detection and a new combination of local and global descriptors. We first detect salient objects in video frames and only extract features for such objects. We then use a simple strategy to identify and process only those video frames that contain salient objects. Processing salient objects instead of all frames not only makes the algorithm more efficient, but more importantly also suppresses the interference of background pixels. We combine this approach with a new combination of local and global descriptors, namely 3D-SIFT and histograms of oriented optical flow (HOOF), respectively. The resulting saliency guided 3D-SIFT–HOOF (SGSH) feature is used along with a multi-class support vector machine (SVM) classifier for human action recognition. Experiments conducted on the standard KTH and UCF-Sports action benchmarks show that our new method outperforms the competing state-of-the-art spatiotemporal feature-based human action recognition method

Highlights

  • Action recognition is a fundamental task and step for many problems in computer vision such as automatic visual surveillance, video retrieval, and humanManuscript received: 2015-12-01; accepted: 2015-12-09 computer interaction

  • Experiments have been performed on standard datasets (KTH and UCFSports), which show that the proposed method outperforms use of state-of-the-art features for action recognition

  • We have proposed a novel video feature extraction method based on saliency detection with a new combination of local and global descriptors

Read more

Summary

Introduction

Manuscript received: 2015-12-01; accepted: 2015-12-09 computer interaction It remains a challenging research area for several reasons. Extracting discriminative and informative features from video frames is challenging. Designing new methods that combine different types of features has become an important issue in action recognition. Researchers have proposed methods based on local representations [3,4,5] that describe characteristics of local regions, global representations [6, 7] that describe video frame characteristics, or a combination of local and global representations [8] to improve the accuracy and benefit from both representations. Local descriptors represent a video as features extracted from a collection of patches, ideally invariant to environmental clutter, appearance change, and occlusion, and possibly to rotation and scale change as well. Combining features can take the advantages of individual features and provide a trade-off between performance and effectiveness

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call