Large-Scale Video Retrieval via Deep Local Convolutional Features

Chen Zhang,Yimu Ji,Yucong Suo,Zhiqiang Zou,Bin Hu

doi:10.1155/2020/7862894

Abstract

In this paper, we study the challenge of image-to-video retrieval, which uses the query image to search relevant frames from a large collection of videos. A novel framework based on convolutional neural networks (CNNs) is proposed to perform large-scale video retrieval with low storage cost and high search efficiency. Our framework consists of the key-frame extraction algorithm and the feature aggregation strategy. Specifically, the key-frame extraction algorithm takes advantage of the clustering idea so that redundant information is removed in video data and storage cost is greatly reduced. The feature aggregation strategy adopts average pooling to encode deep local convolutional features followed by coarse-to-fine retrieval, which allows rapid retrieval in the large-scale video database. The results from extensive experiments on two publicly available datasets demonstrate that the proposed method achieves superior efficiency as well as accuracy over other state-of-the-art visual search methods.

Highlights

Enormous images and videos are generated and uploaded onto the Internet
The well-known I2I visual search can be used for product search, in which relevant images are retrieved by the query image. e V2V search is commonly used for copyright protection, in which video clips are found via a relevant video. e I2V search addresses the problem of retrieving relevant video frames or specific timestamps from a large database via the query image. is technology is relevant for numerous applications, such as brand monitoring, searching film using slides, and searching lecture videos using screenshots
In Deep Feature Temporal Aggregation (DFTA), features in the same shot are aggregated into a single feature to reduce redundant information between adjacent frames

Summary

Introduction

Enormous images and videos are generated and uploaded onto the Internet. With a large amount of publicly available data, visual search has become an important frontier topic in the field of information retrieval. ere exist several kinds of visual search tasks, including image-to-image (I2I) search [1, 2], video-to-video (V2V) search [3, 4], and image-tovideo (I2V) search [5, 6]. With a large amount of publicly available data, visual search has become an important frontier topic in the field of information retrieval. The well-known I2I visual search can be used for product search, in which relevant images are retrieved by the query image. E I2V search addresses the problem of retrieving relevant video frames or specific timestamps from a large database via the query image. We study the specific task of I2V search, which is especially challenging because of the asymmetry between the query image and the video data. To perform large-scale retrieval, we should select representative frames of a video frame sequence to reduce redundant information for further processes. We propose a cluster-based key-frame extraction algorithm to summarize the video sequences

Objectives

Results

Conclusion