Abstract

Recently, with the popularization of camera tools such as mobile phones and the rise of various short video platforms, a lot of videos are being uploaded to the Internet at all times, for which a video retrieval system with fast retrieval speed and high precision is very necessary. Therefore, content-based video retrieval (CBVR) has aroused the interest of many researchers. A typical CBVR system mainly contains the following two essential parts: video feature extraction and similarity comparison. Feature extraction of video is very challenging, previous video retrieval methods are mostly based on extracting features from single video frames, while resulting the loss of temporal information in the videos. Hashing methods are extensively used in multimedia information retrieval due to its retrieval efficiency, but most of them are currently only applied to image retrieval. In order to solve these problems in video retrieval, we build an end-to-end framework called deep supervised video hashing (DSVH), which employs a 3D convolutional neural network (CNN) to obtain spatial-temporal features of videos, then train a set of hash functions by supervised hashing to transfer the video features into binary space and get the compact binary codes of videos. Finally, we use triplet loss for network training. We conduct a lot of experiments on three public video datasets UCF-101, JHMDB and HMDB-51, and the results show that the proposed method has advantages over many state-of-the-art video retrieval methods. Compared with the DVH method, the mAP value of UCF-101 dataset is improved by 9.3%, and the minimum improvement on JHMDB dataset is also increased by 0.3%. At the same time, we also demonstrate the stability of the algorithm in the HMDB-51 dataset.

Highlights

  • In the past several years, video information has been widely used because of its richer content and it is easier to understand compared with other media

  • The results on JHMDB are similar to those on UCF101, and our method still has a significant advantage over iterative quantization (ITQ) and DVH, even compared with the latest method BIDLSTM we have a slight advantage about 0.3–2%

  • A convolutional neural network (CNN) model with stack heterogeneous convolutional multi-kernel is used to do feature extraction of the frames, bidirectional long short term memory (LSTM) network is applied for maintaining the temporal information

Read more

Summary

A Supervised Video Hashing Method

Shanghai Engineering Research Center of Assistive Devices, School of Medical Instrument and Food. Major of Electrical Engineering and Electronics, Graduate School of Engineering, Kogakuin University, Tokyo 163-8677, Japan.

Introduction
D Convolutional Neural Network
Hashing
Video Retrieval
Proposed Approach
Frames Selection
Feature Extraction
Hash Layer
Loss Function
Triplet Selection
Experiments
Datasets and Pre-Trained Model
Experimental Settings
Evaluation Metrics
Experimental Results
Experimental Results on UCF-101
Methods
Experimental Results on JHMDB
Experimental Results on HMDB-51
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call