Abstract

Text detection has been significantly boosted by the development of deep neural networks but most existing methods focus on a single kind of text instance (i.e., overlaid text, layered text, scene text). In this paper, we expand the text detection task from a single dimension to multiple dimensions, thus providing multi-type text descriptions for the scene and content analysis of videos. Specifically, we establish a new task to detect and classify text instances simultaneously, termed TextDC. As far as we know, existing benchmarks cannot meet the requirements of the proposed task. To this end, we collect a large-scale text detection and classification dataset, named Text3C, which is annotated using multilingual labels, location information, and text categories. Together with the collected dataset, we introduce a multi-stage and strict evaluation metric, which penalizes detection approaches for missing text instances, false positive detection, inaccurate location boxes, and error text categories, developing a new benchmark for the proposed TextDC task. In addition, we extend several state-of-the-art detectors by modifying the prediction head to solve the new task. Then, a generalized text detection and classification framework is designed and formulated. Extensive experiments using the updated methods are conducted on the established benchmark to verify the solvability of the proposed task, the challenges of the dataset, and the effectiveness of the solution.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call