Abstract

Key-frame extraction for first-person vision (FPV) videos is a core technology for selecting important scenes and memorizing impressive life experiences in our daily activities. The difficulty of selecting key frames is the scene instability caused by head-mounted cameras used for capturing FPV videos. Because head-mounted cameras tend to frequently shake, the frames in an FPV video are noisier than those in a third-person vision (TPV) video. However, most existing algorithms for key-frame extraction mainly focus on handling the stable scenes in TPV videos. The technical development of key-frame extraction techniques for noisy FPV videos is currently immature. Moreover, most key-frame extraction algorithms mainly use visual information from FPV videos, even though our visual experience in daily activities is associated with human motions. To incorporate the features of dynamically changing scenes in FPV videos into our methods, integrating motions with visual scenes is essential. In this paper, we propose a novel key-frame extraction method for FPV videos that uses multi-modal sensor signals to reduce noise and detect salient activities via projecting multi-modal sensor signals onto a common space by canonical correlation analysis (CCA). We show that the two proposed multi-sensor integration models for key-frame extraction (a sparse-based model and a graph-based model) work well on the common space. The experimental results obtained using various datasets suggest that the proposed key-frame extraction techniques improve the precision of extraction and the coverage of entire video sequences.

Highlights

  • First-person vision (FPV) videos captured by head-mounted wearable cameras are useful for understanding daily life activities [1], [2]

  • This unconstrained FPV video often contains insignificant objects, such as a ceiling or a floor. (iii) Content: third-person view (TPV) videos record experiences worth remembering through a manual operation that focuses on specific interesting scenes

  • We show that the proposed multi-sensor integration is effective for key-frame extraction from FPV videos under both sparse-based and graph-based models

Read more

Summary

INTRODUCTION

First-person vision (FPV) videos captured by head-mounted wearable cameras are useful for understanding daily life activities [1], [2]. We present a key-frame extraction method for FPV videos with multi-sensor signals. Y. Li et al.: Multi-Sensor Integration for Key-Frame Extraction From First-Person Videos sensor information beyond video frames, while most existing methods use only video information [3]–[19]. We assume that motion information expresses the detailed hand or head movement that visual information does not capture To associate their features, we embed multi-sensor data into a common vector space [20]–[27] using probabilistic canonical correlation analysis (PCCA) [28]. We show that the proposed multi-sensor integration is effective for key-frame extraction from FPV videos under both sparse-based and graph-based models. The proposed multi-sensor integration can improve the key-frame extraction performance across different methods. The proposed multi-sensor integration can improve the key-frame extraction performance across different methods. 2) We expand the experimental results by adding more videos to the dataset used in the conference papers and by introducing another new dataset and quantitative comparisons with the existing methods

RELATED WORKS
PROJECTION WITH MULTI-SENSOR INTEGRATION
FACTOR-GRAPH-BASED KEY-FRAME EXTRACTION
EXPERIMENTAL SETTINGS
METRICS
SPARSE MODEL
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call