Abstract

Nom is an ancient script used in Vietnam until the current Latin-based Vietnamese alphabet became common, and a large number of ancient Nom documents are in existence. Due to the gradual degradation of Nom documents and a decrease in the number of scholars who can understand them, a system to digitalize Nom documents is urgently necessary. This paper presents a segmentation-based method for digitalizing Nom documents using deep convolution neural networks. Nom pages are preprocessed, segmented into isolated characters, and then recognized by a single-character OCR. The structure of the U-Net is applied to create segmentation maps and extract character regions from them. Subsequently, we propose coarse and fine combined classifiers to recognize each character pattern. The results by the best classifier are revised by a decoder using a langue model. The decoder is the same as the connectionist temporal classification decoder used in end-to-end text recognition systems. Compared with the traditional segmentation method using projection profiles and the Voronoi diagram (IoU = 81.23%), the segmentation method using the deep convolution neural network produces a better result (IoU = 92.08%) for detecting character regions. The proposed CNN models for recognizing segmented character patterns outperforms the traditional models using the modified quadratic discriminant function and the learning vector quantization with the recognition rate of 85.07%. The combination of coarse and fine classifiers, the training dataset with salt and pepper noises, and the attention layer are the key factors in the recognition rate improvement.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.