Text Detection Model for Historical Documents Using CNN and MSER

Xiaogang Qiu,Rankang Li,Fujia Zhao,Shanxiong Chen

doi:10.4018/jdm.322086

Abstract

This article introduces a text detection model for historical documents images. The handwritten characters in historical documents are always difficult to detect because they contain fuzzy or missing ink, or weathering features and stains; these features will seriously affect the detection accuracy. In order to reduce the influence mentioned above, an effective ATD model is proposed to detect the textbox of characters in historical documents image, and ATD model includes a CNN-based text-box generation network and an NMS-based MSER text-box generation model. As a post-processing method, a text merging algorithm is proposed to achieve higher detection accuracy. The test results on historical document datasets such as Yi, English, Latin, and Italian datasets show that the method in this paper has good accuracy, and it has taken a solid step for the detection of historical documents.

Full Text