Visual object tracking poses challenges due to deformation of target object appearance, fast motion, brightness change, blocking due to obstacles, etc. In this paper, a Siamese network that is configured using a convolutional neural network is proposed to improve tracking accuracy and robustness. Object tracking accuracy is dependent on features that can well represent objects. Thus, we designed a convolutional neural network structure that can preserve feature information that is produced in the previous layer to extract spatial and semantic information. Features are extracted from the target object and search area using a Siamese network, and the extracted feature map is input into the region proposal network, where fast Fourier-transform convolution is applied. The feature map produces a probability score for the presence of an object region and an object in a region, where the similarities are high to search the target. The network was trained with a video dataset called ImageNet Large Scale Visual Recognition Challenge. In the experiment, quantitative and qualitative evaluations were conducted using the object-tracking benchmark dataset. The evaluation results indicated competitive results for some video attributes through various experiments. By conducting experiments, the proposed method achieved competitive results for some video attributes, with a success metric of 0.632 and a precision metric of 0.856 as quantitative values.
Read full abstract