User-Guided Clustering for Video Segmentation on Coarse-Grained Feature Extraction

Xinhui Peng,Jilong Wang,Rui Li,Hao Shang

doi:10.1109/access.2019.2946889

Xinhui Peng, Jilong Wang + Show 2 more

Open Access

https://doi.org/10.1109/access.2019.2946889

Copy DOI

Abstract

Video segmentation is the task of temporally dividing a video into semantic sections, which are typically based on a specific concept or a theme that is usually defined by the user’s intention. However, previous studies of video segmentation have that far not taken a user’s intention into consideration. In this paper, a two-stage user-guided video segmentation framework has been presented, including dimension reduction and temporal clustering. During the dimension reduction stage, a coarse granularity feature extraction is conducted by a deep convolutional neural network pre-trained on ImageNet. In the temporal clustering stage, the information of the user’s intention is utilized to segment videos on time domain with a proposed operator, which calculates the similarity distance between dimension reduced frames. To provide more insight into the videos, a hierarchical clustering method that allows users to segment videos at different granularities is proposed. Evaluation on Open Video Scene Detection(OVSD) dataset shows that the average F-score achieved by the proposed method is 0.72, even coarse-grained feature extraction is adopted. The evaluation also demonstrated that the proposed method can not only produce different segmentation results according to the user’s intention, but it also produces hierarchical segmentation results from a low level to a higher abstraction level.

Highlights

Video segmentation is the task of temporally dividing a video into semantic sections [1], which are typically based on a specific concept or a theme usually defined by the user’s intention
This paper proposes a temporal clustering method to deal with user intention for video segmentation, which utilizes several segmentation reference points set by users to temporally cluster frames into semantic sections
The remaining part of this section is organized as follows. (i) We introduce an operator for calculating the similarity distance of inter-frame. (ii) We utilize the operator to regress the cluster radius with the user’s intention on time domain. (iii) We propose a temporal clustering method based on the regressed radius

Summary

Introduction

Video segmentation is the task of temporally dividing a video into semantic sections [1], which are typically based on a specific concept or a theme usually defined by the user’s intention. Different segmentation granularities may exist, which mainly refer to shots and scenes. The shot refers to a series of frames taken from the same camera in continuous time. The scene is a sequence of semantically related and temporally adjacent shots depicting a high-level concept or story. Video segmentation is fundamental to the process of summarizing, retrieving, understanding, and classifying the content of a video. There are three basic research approaches that have been adopted for video segmentation. The first is the rules-based method, which uses heuristic rules derived from the film industry to divide videos [2]–[7].

Methods

Results

Discussion

Conclusion