Abstract

Recently, deep learning techniques have contributed to solving a multitude of computer vision tasks. In this paper, we propose a deep-learning approach for video alignment, which involves finding the best correspondences between two overlapping videos. We formulate the video alignment task as a variant of the well-known machine comprehension (MC) task in natural language processing. While MC answers a question about a given paragraph, our technique determines the most relevant frame sequence in the context video to the query video. This is done by representing the individual frames of the two videos by highly discriminative and compact descriptors. Next, the descriptors are fed into a multi-stage network that is able, with the help of the bidirectional attention flow mechanism, to represent the context video at various granularity levels besides estimating the query-aware context part. The proposed model was trained on 10k video-pairs collected from “YouTube”. The obtained results show that our model outperforms all known state of the art techniques by a considerable margin, confirming its efficacy.

Highlights

  • Video Alignment refers to identifying the best correspondences, in both the spatial and temporal aspects, between two given videos

  • The prediction of the end index relies on the weighted context by the predicted start probabilities, it is not the only parameter to affect the prediction

  • 1) BI-DIRECTIONAL ATTENTION FLOW We prove the power of the bi-directional attention flow mechanism, which represents the core of the proposed model, through ablating its main elements as follows: Ablating the Context to Query Attention: when the attended query vector U :m is replaced by the average of the contextual query vectors, a drop in the retrieval precision by more than 6% is observed

Read more

Summary

Introduction

Video Alignment refers to identifying the best correspondences, in both the spatial and temporal aspects, between two given videos. Video synchronization aims at mapping each frame in the input sequence to the most alike one in the reference sequence, taking into consideration their temporal order [1]. The video alignment task has contributed to many computer vision applications. The majority of the currently-existing alignment techniques imposed some restrictions like requiring the cameras to be rigidly connected [7], [8], or the need to track some feature points along the whole videos [9], or assuming the temporal relation among the two videos to be constant [7]. We aim to relax these constraints to cope better with real practical applications

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call