Spatio-temporal Interest Point Detector Research Articles

This work presents a theory and methodology for simultaneous detection of local spatial and temporal scales in video data. The underlying idea is that if we process video data by spatio-temporal receptive fields at multiple spatial and temporal scales, we would like to generate hypotheses about the spatial extent and the temporal duration of the underlying spatio-temporal image structures that gave rise to the feature responses. For two types of spatio-temporal scale-space representations, (i) a non-causal Gaussian spatio-temporal scale space for offline analysis of pre-recorded video sequences and (ii) a time-causal and time-recursive spatio-temporal scale space for online analysis of real-time video streams, we express sufficient conditions for spatio-temporal feature detectors in terms of spatio-temporal receptive fields to deliver scale-covariant and scale-invariant feature responses. We present an in-depth theoretical analysis of the scale selection properties of eight types of spatio-temporal interest point detectors in terms of either: (i)–(ii) the spatial Laplacian applied to the first- and second-order temporal derivatives, (iii)–(iv) the determinant of the spatial Hessian applied to the first- and second-order temporal derivatives, (v) the determinant of the spatio-temporal Hessian matrix, (vi) the spatio-temporal Laplacian and (vii)–(viii) the first- and second-order temporal derivatives of the determinant of the spatial Hessian matrix. It is shown that seven of these spatio-temporal feature detectors allow for provable scale covariance and scale invariance. Then, we describe a time-causal and time-recursive algorithm for detecting sparse spatio-temporal interest points from video streams and show that it leads to intuitively reasonable results. An experimental quantification of the accuracy of the spatio-temporal scale estimates and the amount of temporal delay obtained from these spatio-temporal interest point detectors is given, showing that: (i) the spatial and temporal scale selection properties predicted by the continuous theory are well preserved in the discrete implementation and (ii) the spatial Laplacian or the determinant of the spatial Hessian applied to the first- and second-order temporal derivatives leads to much shorter temporal delays in a time-causal implementation compared to the determinant of the spatio-temporal Hessian or the first- and second-order temporal derivatives of the determinant of the spatial Hessian matrix.

Recent progress in the field of human action recognition points towards the use of Spatio-Temporal Interest Points (STIPs) for local descriptor-based recognition strategies. In this paper, we present a novel approach for robust and selective STIP detection, by applying surround suppression combined with local and temporal constraints. This new method is significantly different from existing STIP detection techniques and improves the performance by detecting more repeatable, stable and distinctive STIPs for human actors, while suppressing unwanted background STIPs. For action representation we use a bag-of-video words (BoV) model of local N-jet features to build a vocabulary of visual-words. To this end, we introduce a novel vocabulary building strategy by combining spatial pyramid and vocabulary compression techniques, resulting in improved performance and efficiency. Action class specific Support Vector Machine (SVM) classifiers are trained for categorization of human actions. A comprehensive set of experiments on popular benchmark datasets (KTH and Weizmann), more challenging datasets of complex scenes with background clutter and camera motion (CVC and CMU), movie and YouTube video clips (Hollywood 2 and YouTube), and complex scenes with multiple actors (MSR I and Multi-KTH), validates our approach and show state-of-the-art performance. Due to the unavailability of ground truth action annotation data for the Multi-KTH dataset, we introduce an actor specific spatio-temporal clustering of STIPs to address the problem of automatic action annotation of multiple simultaneous actors. Additionally, we perform cross-data action recognition by training on source datasets (KTH and Weizmann) and testing on completely different and more challenging target datasets (CVC, CMU, MSR I and Multi-KTH). This documents the robustness of our proposed approach in the realistic scenario, using separate training and test datasets, which in general has been a shortcoming in the performance evaluation of human action recognition techniques.

Spatio-temporal Interest Point Detector Research Articles

Related Topics

Articles published on Spatio-temporal Interest Point Detector

Spatiotemporal interest point detector exploiting appearance and motion-variation information

GA-STIP: Action Recognition in Multi-Channel Videos With Geometric Algebra Based Spatio-Temporal Interest Points

Spatio-Temporal Scale Selection in Video Data

A unified model of appearance and motion of video and its application in STIP detection

Survey of Spatio-Temporal Interest Point Detection Algorithms in Video

Learning semantic context feature-tree for action recognition via nearest neighbor fusion

Action Recognition in Realistic Scenes via Local Spatio-temporal Representation

Automatic Definition of Regions of Interest on Renal Scintigraphic Images

Large scale continuous visual event recognition using max-margin Hough transformation framework

Selective spatio-temporal interest points

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Spatio-temporal Interest Point Detector Research Articles

Related Topics

Articles published on Spatio-temporal Interest Point Detector

Spatiotemporal interest point detector exploiting appearance and motion-variation information

GA-STIP: Action Recognition in Multi-Channel Videos With Geometric Algebra Based Spatio-Temporal Interest Points

Spatio-Temporal Scale Selection in Video Data

A unified model of appearance and motion of video and its application in STIP detection

Survey of Spatio-Temporal Interest Point Detection Algorithms in Video

Learning semantic context feature-tree for action recognition via nearest neighbor fusion

Action Recognition in Realistic Scenes via Local Spatio-temporal Representation

Automatic Definition of Regions of Interest on Renal Scintigraphic Images

Large scale continuous visual event recognition using max-margin Hough transformation framework

Selective spatio-temporal interest points