Scene video text tracking based on hybrid deep text detection and layout constraint

Xihan Wang,Xiaoyi Feng,Zhaoqiang Xia

doi:10.1016/j.neucom.2019.05.101

Abstract

Video text in real-world scenes often carries rich high-level semantic information and plays an ever-increasingly important role in the content-based video analysis and retrieval. Therefore, the scene video text detection and tracking are important prerequisites of numerous multimedia applications. However, the performance of most existing tracking methods is not satisfactory due to frequent mis-detections, unexpected camera motion and similar appearances between text regions. To address these problems, we propose a new video text tracking approach based on hybrid deep text detection and layout constraint. Firstly, a deep text detection network that combines the advantages of object detection and semantic segmentation in a hybrid way is proposed to locate possible text candidates in individual frames. Then, text trajectories are derived from consecutive frames with a novel data association method, which effectively exploits the layout constraint of text regions in large camera motion. By utilizing the layout constraint, the ambiguities caused by similar text regions are effectively reduced. We conduct experiments on four benchmark datasets, i.e., ICDAR 2015, MSRA-TD 500, USTB-SV1K and Minetto, to evaluate the proposed method. The experimental results demonstrate the effectiveness and superiority of the proposed approach.

Full Text