Abstract

In the last decade, video content analysis has attracted increasing research interest in the fields of multimedia and computer vision. With the explosive growth of videos on the web and other multimedia sources, it is important for many applications to build effective models that can help us automatically analyse the videos. Among different video content analysis tasks, event detection, recognition, recounting and retrieval in unconstrained cases are the most challenging, because events often consist of miscellaneous spatial-temporal semantics such as various objects, human actions and scenes. In order to better analyse the events contained in videos, scholars have tried to either design powerful visual features or build effective models. However, several technical issues have not yet been well addressed. These include, for example, how to reduce the computational complexity of the hash model training procedure for video event retrieval when given more training videos, how to integrate the spatial and temporal information well in videos for event detection, and how to make use of contextual information to enhance the model training. This thesis focuses on building effective and efficient models for video event detection, recognition and retrieval, and it contains the following four parts: The first part aims to design a generic model Max-margin adaptive model (MMA) for video pattern recognition. The MMA model adopts the advantages of semi-supervised learning and transfer learning, which can utilize both labelled and unlabelled videos for model training. It considers the data distribution consistency between labelled videos and unlabelled auxiliary videos from a statistical perspective by learning an optimal mapping function. It also broadens the geometric margin between positive-labelled videos and negative-labelled videos to improve the robustness of the model. The second part aims to build a deep spatial-temporal model for multimedia event detection (MED). In our setting, each video follows a multiple instance assumption, where its visual segments contain both spatial and temporal properties of events. Regarding these properties, we try to implement the MED system by a two-step deep training model: unsupervised recurrent video reconstruction and supervised fine-tuning, to improve the generality of the model and boost the event detection accuracy. In the third part, we propose a context based framework for web video event recognition. Different from content based video recognition tasks, our proposed framework considers the properties of both video content and web documents. Web videos often describe large-granular events and carry very limited textual information. In this work we first construct an event knowledge base by deeply mining the semantic information from web documents, then propose a Two-view adaptive regression model (TVAR) that explores the intrinsic correlation between the visual and textual cues of the web videos to learn reliable classifiers. In the fourth part, we set out a hashing model Visual State Binary Embedding (VSBE) for scalable video event retrieval. The VSBE model can preserve the essential semantic information of the videos in binary codes to ensure effective retrieval performance. Compared with other video binary embedding models, one advantage of our proposed method is it only needs a limited number of key frames from the training videos for hash model training, so the computational complexity is much lower in the training phase. At the same time, we apply the pair-wise constraints generated from the visual states to sketch the local properties of the events at the semantic level, so accuracy is also guaranteed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call