As a local invariant feature of videos, the spatiotemporal interest point (STIP) has been widely used in computer vision and pattern recognition. However, existing STIP detectors are generally extended from detection algorithms constructed for local invariant features of two-dimensional images, which does not explicitly exploit the motion information inherent in the temporal domain of videos, thus weakening the performance of existing STIP detectors in a video context. To remedy this, we aim to develop an STIP detector that uniformly captures appearance and motion information for video, thus yielding substantial performance improvement. Specifically, under the framework of geometric algebra, we first develop a spatiotemporal unified model of appearance and motion-variation information (UMAMV), and then a UMAMV-based scale space of the spatiotemporal domain is proposed to synthetically analyze appearance information and motion information in a video. Based on this model, we propose an STIP feature of UMAMV-SIFT that embraces both appearance and motion variation information of the videos. Three datasets with different sizes are utilized to evaluate the proposed model and the STIP detector. We present experimental results to show that the UMAMV-SIFT achieves state-of-the-art performance and is particularly effective when dataset is small.
Read full abstract