End-to-end video subtitle recognition via a deep Residual Neural Network

Hongyu Yan,Xin Xu

doi:10.1016/j.patrec.2020.01.019

Abstract

Video subtitle recognition can substantially facilitate a wide range of applications like automatic video retrieval and summarization. However, current methods face great challenges in video subtitle recognition, due to complex backgrounds, diverse fonts, and low contrast between texts and backgrounds. In this paper, we propose an end-to-end pipeline for recognizing subtitles of video. Connectionist Text Proposal Network (CTPN) is utilized for video subtitle detection, while Residual Network (ResNet), Gated Recurrent Unit (GRU) and Connectionist Temporal Classification (CTC) are used to recognize Chinese and English subtitles in video images. Specifically, the subtitle area in video images is firstly located via the CTPN method. Afterwards, the detected subtitle area is inputted into the ResNet for extracting the feature sequences. Next, a bidirectional GRU layer is employed to model the feature sequences. Finally, CTC is adopted to calculate the loss and output the final result. On two public datasets ICDAR2003 and ICDAR2013, the proposed method can get better performance with 92.3% and 89.2% recognition accuracy in all of the current methods. In addition, experiments on real video subtitle have also proved that the proposed method achieves the best performance in the state-of-art methods.

Full Text