Abstract

First-person video summarization has emerged as an important research problem for computer vision and multimedia communities. In this paper, we show how different graph representations can be developed for accurately summarizing first-person (egocentric) videos in a computationally efficient manner. Each frame in a video is first represented as a weighted graph. A shot boundary detection method using graph based mutual information is developed. We next construct a weighted graph for each shot. A representative frame from each shot is selected using a graph centrality measure. A new way of characterizing egocentric video frames using a graph based center-surround model is shown next. Here, each representative frame is modeled as a union of a center region (graph) and a surround region (graph). By exploiting spectral measures of dissimilarity between the two (center and surround) graphs, optimal center and surround regions are determined. Optimal regions for all frames within a shot are kept the same as that of the representative frame. Center-surround differences in entropy and optical flow values along with PHOG (Pyramidal HOG) features are extracted from each frame. All frames in a video are finally represented by another weighted graph, termed as a Video Similarity Graph (VSG). The frames are clustered by applying a Minimum Spanning Tree (MST) based approach with a new measure for inadmissible edges. Frames closest to the centroid of each cluster are captured to build the summary. Experimental evaluation on two benchmark datasets indicate the advantage of the proposed formulation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call