Unsupervised Mining of Statistical Temporal Structures in Video

Lexing Xie,Huifang Sun,Ajay Divakaran,Shih-Fu Chang

doi:10.1007/978-1-4757-6928-9_10

Abstract

In this chapter we present algorithms for unsupervised mining of struc-tures in video using multi-scale statistical models. Video structure are repetitive segments in a video stream with consistent statistical characteristics. Such structures can often be interpreted in relation to distinctive semantics, particularly in structured domains like sports. While much work in the literature explores the link between the observations and the semantics using supervised learning, we propose unsupervised structure mining algorithms that aim at alleviating the burden of labelling and training, as well as providing a scalable solution for generalizing video indexing techniques to heterogeneous content collections such as surveillance and consumer video. Existing unsupervised video structuring work primarily uses clustering techniques, while the rich statistical characteristics in the temporal dimension at different granularities remain unexplored. Automatically identifying structures from an unknown domain poses significant challenges when domain knowledge is not explicitly present to assist algorithm design, model selection, and feature selection. In this work we model multi-level statistical structures with hierarchical hidden Markov models based on a multi-level Markov dependency assumption. The parameters of the model are efficiently estimated using the EM algorithm. We have also developed a model structure learning algorithm that uses stochastic sampling techniques to find the optimal model structure, and a feature selection algorithm that automatically finds compact relevant feature sets using hybrid wrapper-filter methods. When tested on sports videos, the unsupervised learning scheme achieves very promising results: (1) The automatically selected feature set for soccer and baseball videos matches sets that are manually selected with domain knowledge. (2) The system automatically discovers high-level structures that match the semantic events in the video. (3) The system achieves better accuracy in detecting semantic events in unlabelled soccer videos than a competing supervised approach designed and trained with domain knowledge.

Full Text