Abstract

Recently, large scale video content retrieval has gained more attention due to a large amount of user-generated images and video content available over the internet. Hashing is one of the effective techniques that encode the high dimensional features vectors into compact binary codes. The aim of hashing is to generate the short binary codes and map the similar hash code values to retrieve similar video from the database with minimum distance measure. Deep learning-based hashing networks are employed to learn the representative video feature to estimate the hash functions. However, the existing hashing approaches fail due to the frame-level feature representation and does not well exploits the effective temporal features in visual search. Furthermore, significant loss of features during the dimensionality reduction step causes low accuracy. Hence, it is essential to develop a deep learning-based hashing framework that should exploits both the strong spatial and temporal features for the scalable video search. The main objective is to learn the high dimensional features from the entire video and derive compact binary codes to retrieve the similar videos for the input query sequence. In this paper, we propose a joint network model of supervised stacked heterogeneous convolutional multi-kernel (Stacked HetConv-MK)-bidirectional Long Short Term Memory (BiDLSTM) network model that effectively encodes the rich structural as well as the discriminative features from the video sequence to estimate the compact binary codes. Initially, the video frames are passed to the stacked convolution networks with heterogeneous convolutional kernel size and residual learning to extract the spatial features at different views from the video sequences and to improve the learning efficiency. Then, the bidirectional network computes the sequence in both forward and backward directions and obtains the series of hidden state output. Finally, the fully connected structure with an activation unit performs hashing to learn the multiple codes for each video. Experimental analysis is performed on three datasets and the result shows a better accuracy measure than the other state-of-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call