Filling the Semantic Gap in Video Retrieval: An Exploration

Alexander Hauptmann,Rong Yan,Michael Christel,Wei-Hao Lin,Howard Wactlar

doi:10.1007/978-1-84800-076-6_10

Abstract

Digital images and motion video have proliferated in the past few years, ranging from ever-growing personal photo and video collections to professional news and documentary archives. In searching through these archives, digital imagery indexing based on low-level image features like colour and texture, or manually entered text annotations, often fails to meet the user’s information need, i.e. there is often a semantic gap produced by “the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation” (Smeulders, Worring, Santini, Gupta and Jain 2000). The image/video analysis community has long struggled to bridge this semantic gap between low-level feature analysis (colour histograms, texture, shape) and semantic content description of video. Early video retrieval systems (Lew 2002; Smith, Lin, Naphade, Natsev and Tseng 2002) usually modelled video clips with a set of (low-level) detectable features generated from different modalities. It is possible to accurately and automatically extract low-level video features, such as histograms in the HSV, RGB, and YUV colour space, Gabor texture or wavelets, and structure through edge direction histograms and edge maps. However, because the semantic meaning of the video content cannot be expressed this way, these systems had a very restricted success with this approach to video retrieval for semantic queries. Several studies have confirmed the difficulty of addressing information needs with such low-level features (Markkula and Sormunen 2000; Rodden, Basalaj, Sinclair and Wood 2001). To overcome this “semantic gap”, one approach is to utilise a set of intermediate textual descriptors that can be reliably applied to visual content concepts (e.g. outdoors, faces, animals). Many researchers have been developing automatic semantic concept classifiers such as those related to people (face, anchor, etc.), acoustic (speech, music, significant pause), objects (image blobs, buildings, graphics),

Full Text