A Sequence Labeling Based Approach for Character Segmentation of Historical Documents

Liangcai Gao,Xiaode Zhang,Yaoxiong Huang,Lianwen Jin,Zhi Tang

doi:10.1109/das.2018.16

Abstract

As an important prerequisite step of historical document image analysis, character segmentation is fundamental but challenging. In this paper, we propose a novel approach for the handwritten character segmentation of historical documents by treating it as a sequence labeling problem. In more detail, the proposed model first segments document image into lines, then each column in the line image is given a label to indicate it is a segmentation position or not. The segmentation labeling is achieved by a neural model, which combines a CNN for feature extraction, a LSTM for sequence modeling and a CRF for sequence labeling. The performance of our methods has been evaluated on a 300-page dataset including 96,479 characters. The experimental results demonstrate that the proposed methods achieve superior or highly competitive performance compared with other methods.

Full Text