The development of appropriate tools and solutions to support effective access to video content is one of the main challenges for video digital libraries. Different techniques for manual and automatic annotation and retrieval have been proposed in recent years. It is a common practice to use linguistic ontologies for video annotation and retrieval: video elements are classified by establishing relationships between video contents and linguistic terms that identify domain concepts at different abstraction levels. However, although linguistic terms are appropriate to distinguish event and object categories, they are inadequate when they must describe specific or complex patterns of events or video entities. Instead, in these cases, pattern specifications can be better expressed using visual prototypes, either images or video clips, that capture the essence of the event or entity. High level concepts, expressed trough linguistic terms, and patterns specification, represented by visual prototypes, can be both organized into new extended ontologies where images or video clips are added to the ontologies as specification of linguistic terms. This paper presents algorithms and techniques that employ enriched ontologies for video annotation and retrieval, and discusses a solution for their implementation for the soccer video domain. An unsupervised clustering method is proposed in order to create multimedia enriched ontologies by defining visual prototypes that represent specific patterns of highlights and adding them as visual concepts to the ontology. An algorithm that uses multimedia enriched ontologies to perform automatic soccer video annotation is proposed and results for typical highlights are presented. Annotation is performed associating occurrences of events, or entities, to higher level concepts by checking their similarity to visual concepts that are hierarchically linked to higher level semantics, using a dynamic programming approach. Usage of reasoning on the ontology is shown, to create complex queries that comprise visual prototypes of actions, their temporal evolution and relations.
Read full abstract