Abstract
Huge amount of videos on the Internet have rare textual information, which makes video retrieval challenging given a text query. Previous work explored semantic concepts for content analysis to assist retrieval. However, the human- defined concepts might fail to cover the data and there is a potential gap between these concepts and the semantics ex- pected from user's query. Also, building a corpus is expensive and time-consuming. To address these issues, we propose a semi-automatic framework to discover the semantic concepts. We limit ourselves in audio modality here. In the paper, we also discuss how to select meaningful vocabulary from the discovered hierarchical sub-categories and provide an ap- proach to detect all the concepts without further annotation. We evaluate the method on NIST 2011 multimedia event detection (MED) dataset. There is a continuing growth of video collections available for searching over the Internet. Given a query, search engine retrieve relevant videos by analyzing their captions or textual descriptions. This initial method faces big problem. A large proportion of the videos are lack of detailed textual informa- tion. Even for those with rich textual descriptions, the gap be- tween the content in the video and the given textual informa- tion is inevitable. To address the problem, advanced technolo- gies in multimedia content analysis have become very popular recently. Previous work has explored detecting semantic concepts in multiple modalities to capture the embedded semantics in the multimedia stream (1) (2) (3). In this paper, we limit ourselves to discuss the audio semantic concepts. The au- dio semantic concepts are often defined by human based on their understanding of the specific application and their ob- servations of limited data. After annotating a training corpus of these defined concepts, people apply multiple supervised methods for the semantic concept detection. Previous stud- ies have adopted methods in speech recognition and speaker identification (2) (3). These approaches have been shown to be effective on certain dataset for certain application. However, there are potential problems of these approaches. Firstly, defining a proper vocabulary of the semantic concepts is time-consuming. Human experts have to observe a large amount of data and summarize the patterns into a number of semantic concepts based on their understanding. In this situation, it is very likely that these semantic concepts fail to cover the data and the vocabulary might be ineffective to retrieve the information needed for the application. These problems become more serious when new videos are added to the collection continuously in practical application. Sec- ondly, the generalization of the human defined vocabulary is another problem. The domain-specific semantic concepts be- come useless in new domains. And other applications might
Highlights
There is a continuing growth of video collections available for searching over the Internet
Human experts have to observe a large amount of data and summarize the patterns into a number of semantic concepts based on their understanding
It is very likely that these semantic concepts fail to cover the data and the vocabulary might be ineffective to retrieve the information needed for the application
Summary
There is a continuing growth of video collections available for searching over the Internet. Human experts have to observe a large amount of data and summarize the patterns into a number of semantic concepts based on their understanding In this situation, it is very likely that these semantic concepts fail to cover the data and the vocabulary might be ineffective to retrieve the information needed for the application. It is very likely that these semantic concepts fail to cover the data and the vocabulary might be ineffective to retrieve the information needed for the application These problems become more serious when new videos are added to the collection continuously in practical application. It is still impossible to retrieve content-related audios given a text query Another problem of the pure unsupervised method is the gap between semantic similarity and acoustic similarity.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have