Abstract

Object tracking based on deep learning is a hot topic in computer vision with many applications. Due to high computation and memory costs, it is difficult to deploy convolutional neural networks (CNNs) for object tracking on embedded systems with limited hardware resources. This paper uses the Siamese network to construct the backbone of our tracker. The convolution layers used to extract features often have the highest costs, so more improvements should be focused on them to make the tracking more efficient. In this paper, the standard convolution is optimized by the separable convolution, which mainly includes a depthwise convolution and a pointwise convolution. To further reduce the calculation, filters in the depthwise convolution layer are pruned with filters variance. As there are different weight distributions in convolution layers, the filter pruning is guided by a hyper-parameter designed. With the improvements, the number of parameters is decreased to 13% of the original network and the computation is reduced to 23%. On the NVIDIA Jetson TX2, the tracking speed increased to 3.65 times on the CPU and 2.08 times on the GPU, without significant degradation of tracking performance in VOT benchmark.

Highlights

  • Visual object tracking tasks predict the object region in the subsequent frames when its size and position are given in the first video frame

  • With the application of deep learning in object tracking, the millions of parameters and huge computation in convolutional neural networks (CNNs) are a challenge for tracking performance

  • The standard CNN network is improved by separable convolution and filters pruning

Read more

Summary

INTRODUCTION

Visual object tracking tasks predict the object region in the subsequent frames when its size and position are given in the first video frame. Convolution layers in deep neural networks usually extract the features of the object region and each video frame. This process incurs most of the parameters and calculations in tracking networks. Filters in the depthwise convolution layer are pruned in trained models. We suggest these filters could contribute little to features extraction These filters can be pruned to reduce network size and computation further. In the subsequent 1 × 1 pointwise convolution, the number of channels in the 1 × 1 pointwise filters will be reduced due to the reduction of input feature maps.

EXPERIMENT SETUP
EVALUATION METRICS
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.