Abstract

Huge amount of videos on the Internet have rare textual information, which makes video retrieval challenging given a text query. Previous work explored semantic concepts for content analysis to assist retrieval. However, the human- defined concepts might fail to cover the data and there is a potential gap between these concepts and the semantics ex- pected from user's query. Also, building a corpus is expensive and time-consuming. To address these issues, we propose a semi-automatic framework to discover the semantic concepts. We limit ourselves in audio modality here. In the paper, we also discuss how to select meaningful vocabulary from the discovered hierarchical sub-categories and provide an ap- proach to detect all the concepts without further annotation. We evaluate the method on NIST 2011 multimedia event detection (MED) dataset. There is a continuing growth of video collections available for searching over the Internet. Given a query, search engine retrieve relevant videos by analyzing their captions or textual descriptions. This initial method faces big problem. A large proportion of the videos are lack of detailed textual informa- tion. Even for those with rich textual descriptions, the gap be- tween the content in the video and the given textual informa- tion is inevitable. To address the problem, advanced technolo- gies in multimedia content analysis have become very popular recently. Previous work has explored detecting semantic concepts in multiple modalities to capture the embedded semantics in the multimedia stream (1) (2) (3). In this paper, we limit ourselves to discuss the audio semantic concepts. The au- dio semantic concepts are often defined by human based on their understanding of the specific application and their ob- servations of limited data. After annotating a training corpus of these defined concepts, people apply multiple supervised methods for the semantic concept detection. Previous stud- ies have adopted methods in speech recognition and speaker identification (2) (3). These approaches have been shown to be effective on certain dataset for certain application. However, there are potential problems of these approaches. Firstly, defining a proper vocabulary of the semantic concepts is time-consuming. Human experts have to observe a large amount of data and summarize the patterns into a number of semantic concepts based on their understanding. In this situation, it is very likely that these semantic concepts fail to cover the data and the vocabulary might be ineffective to retrieve the information needed for the application. These problems become more serious when new videos are added to the collection continuously in practical application. Sec- ondly, the generalization of the human defined vocabulary is another problem. The domain-specific semantic concepts be- come useless in new domains. And other applications might

Highlights

  • There is a continuing growth of video collections available for searching over the Internet

  • Human experts have to observe a large amount of data and summarize the patterns into a number of semantic concepts based on their understanding

  • It is very likely that these semantic concepts fail to cover the data and the vocabulary might be ineffective to retrieve the information needed for the application

Read more

Summary

INTRODUCTION

There is a continuing growth of video collections available for searching over the Internet. Human experts have to observe a large amount of data and summarize the patterns into a number of semantic concepts based on their understanding In this situation, it is very likely that these semantic concepts fail to cover the data and the vocabulary might be ineffective to retrieve the information needed for the application. It is very likely that these semantic concepts fail to cover the data and the vocabulary might be ineffective to retrieve the information needed for the application These problems become more serious when new videos are added to the collection continuously in practical application. It is still impossible to retrieve content-related audios given a text query Another problem of the pure unsupervised method is the gap between semantic similarity and acoustic similarity.

Overview
Learning Acoustic Descriptors
Hierarchical Sub-categories Discovery
Semantic Concept Vocabulary Selection
The dataset and seed annotations
Experiment setup
Result and Analysis
CONCLUSION
Vocabulary Selection for MED
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call