Video retrieval of near-duplicates using κ-nearest neighbor retrieval of spatio-temporal descriptors

Daniel Dementhon,David Doermann

doi:10.1007/s11042-006-0029-z

Abstract

This paper describes a novel methodology for implementing video search functions such as retrieval of near-duplicate videos and recognition of actions in surveillance video. Videos are divided into half-second clips whose stacked frames produce 3D space-time volumes of pixels. Pixel regions with consistent color and motion properties are extracted from these 3D volumes by a threshold-free hierarchical space-time segmentation technique. Each region is then described by a high-dimensional point whose components represent the position, orientation and, when possible, color of the region. In the indexing phase for a video database, these points are assigned labels that specify their video clip of origin. All the labeled points for all the clips are stored into a single binary tree for efficient % MathType!MTEF!2!1!+-% feaaeaart1ev0aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbbjxAHX% garmWu51MyVXgatuuDJXwAK1uy0HwmaeHbfv3ySLgzG0uy0Hgip5wz% aebbfv3ySLgzGueE0jxyaibaieYdf9irVeeu0dXdh9vqqj-hEeeu0x% Xdbba9frFj0-OqFfea0dXdd9vqaq-JfrVkFHe9pgea0dXdar-Jb9hs% 0dXdbPYxe9vr0-vr0-vqpWqaaeaabiGaciaacaqabeaadaqadqaaaO% qaaiabeQ7aRbaa!3AF7! $$\kappa $$ -nearest neighbor retrieval. The retrieval phase uses video segments as queries. Half-second clips of these queries are again segmented by space-time segmentation to produce sets of points, and for each point the labels of its nearest neighbors are retrieved. The labels that receive the largest numbers of votes correspond to the database clips that are the most similar to the query video segment. We illustrate this approach for video indexing and retrieval and for action recognition. First, we describe retrieval experiments for dynamic logos, and for video queries that differ from the indexed broadcasts by the addition of large overlays. Then we describe experiments in which office actions (such as pulling and closing drawers, taking and storing items, picking up and putting down a phone) are recognized. Color information is ignored to insure independence of action recognition to people's appearance. One of the distinct advantages of using this approach for action recognition is that there is no need for detection or recognition of body parts.

Full Text