Abstract

Thanks to the rapid advances in mobile video capturing devices and in network connections, more and more users prefer to use videos to record their daily life. Video is becoming a common means of recording everything from marriage proposals to how to repair an appliance. The number of user generated videos that record complex events is exploding on the Web. Therefore, technologies to assist in automatically understanding video events are in high demand to analyse and manage this exploding amount of video content.The main challenges of video event understanding come from the diversity and complexity of the video content and its temporal nature. To deal with these challenges, in this thesis, we exploit semantic and temporal information for video event classification and retrieval, which are two main tasks about video event understanding. Our work is organized into four main chapters.For video event classification, in Chapter 3, to address the diversity and complexity of the video content, we define two types of latent concepts, i.e. a static-visual concept at frame-level and an activity concept at segment-level, to alleviate the influence of high intra-class variation. Furthermore, we propose a data-driven hierarchical structure of latent variables to discover the latent concepts, where temporal information is utilized in the discovery process. In Chapter 4, Long Short-term Memory (LSTM) is employed to capture the temporal information in videos. A novel temporal attention model is proposed, which enables our framework to focus on the most related shots during the classification procedure. Moreover, weak semantic relevance is incorporated as fine-grained guidance (at shot-level) for the proposed temporal attention model to further enhance the classification performance. In contrast to the work in Chapter 3, where the underlying semantic information is organized as latent concepts and the latent concept discovery process is data-driven, in the proposed framework in Chapter 4, semantic information is formalized as weak semantic relevance and is employed as explicit supervision.Recently, hashing has been evidenced as an efficient and effective method to facilitate large-scale video retrieval. Most of existing hashing methods are based on static features. The intrinsic temporal patterns embedded in videos have also shown their discriminative power for similarity search. However, how to leverage the strengths of both these aspects remains unknown. In Chapter 5, we propose to jointly model two essential aspects of videos (i.e. temporal pattern and static feature), with two encoders, for unsupervised video event retrieval. For jointly modelling, three learning criteria for generating high-quality hash codes are imposed on the two encoders. To further explore how to utilize features in regard to both aspects more effectively, in Chapter 6, we propose a novel information filtering mechanism which we call Adaptive Selection, which exploits the complementary advantages of the two aspects for supervised video event retrieval.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call