Abstract

Although a large number of methods for video text detection and recognition have been proposed over the past years, it is hard to find the best state-of-the-art method because of nonavailability of standard datasets, ground truth, and common evaluation measures. Therefore, in this paper, we propose a semiautomatic system for ground truth generation for video text detection and recognition, which includes English and Chinese text of different orientation. The system has a facility to allow the user to manually correct the ground truth if the automatic method produces incorrect results. We propose eleven attributes at the word level, namely: line index, word index, coordinate values of bounding box, area, content, script type, orientation information, type of text (caption/scene), condition of text (distortion/distortion free), start frame, and end frame to evaluate the performance of the method. We also introduce a new dataset that consists of 466 video frames collected from TRECVID 2005 and 2006 databases. The video frames in our dataset contain both horizontal texts (278 frames: 181 with English texts and 97 with Chinese texts) and nonhorizontal texts (188 frames: 140 English and 48 Chinese). Furthermore, the performance of the proposed system is compared with existing text detection methods by calculating measures manually and automatically to show usefulness of our semiautomatic system. The ground truth and the semiautomatic system will be released to the public.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call