Abstract

In this paper, we introduce an end-to-end Amharic text-line image recognition approach based on recurrent neural networks. Amharic is an indigenous Ethiopic script which follows a unique syllabic writing system adopted from an ancient Geez script. This script uses 34 consonant characters with the seven vowel variants of each (called basic characters) and other labialized characters derived by adding diacritical marks and/or removing parts of the basic characters. These associated diacritics on basic characters are relatively smaller in size, visually similar, and challenging to distinguish from the derived characters. Motivated by the recent success of end-to-end learning in pattern recognition, we propose a model which integrates a feature extractor, sequence learner, and transcriber in a unified module and then trained in an end-to-end fashion. The experimental results, on a printed and synthetic benchmark Amharic Optical Character Recognition (OCR) database called ADOCR, demonstrated that the proposed model outperforms state-of-the-art methods by 6.98% and 1.05%, respectively.

Highlights

  • Amharic is the second-largest Semitic dialect in the world after Arabic [1]

  • This paper is an extension of the previous work [9] with the following summarized contributions: (1) To extract automatic features from text-line images, we propose a Convolutional Neural Networks (CNNs)-based feature extractor module

  • To select suitable network parameters, different values of these parameters were considered and tuned during experimentation, and the results reported in this paper were obtained using an Adam optimizer employing a convolutional neural network with a feature map that started from 64 and increased to 512, the Bidirectional LSTM (BLSTM) network with two network hidden layers with sizes of 128 each, and a learning rate of 0.001

Read more

Summary

Introduction

Amharic is the second-largest Semitic dialect in the world after Arabic [1]. It is an official working language of the Federal Democratic Republic of Ethiopia and is spoken by more than 50 million people as their mother language and by over 100 million as a second language in the country [2,3].In addition to Ethiopia, it is spoken in other countries like Eritrea, USA, Israel, Sweden, Somalia, and Djibouti [1,4].Dated back to the 12th century, many historical and literary documents in Ethiopia are written and documented using Amharic script. Amharic is the second-largest Semitic dialect in the world after Arabic [1]. It is an official working language of the Federal Democratic Republic of Ethiopia and is spoken by more than 50 million people as their mother language and by over 100 million as a second language in the country [2,3]. Dated back to the 12th century, many historical and literary documents in Ethiopia are written and documented using Amharic script. Amharic is a syllabic writing system which is derived from an ancient script called Geez, and it has been extensively used in all government and non-government sectors in Ethiopia until today. Amharic took all of the symbols in Geez and added some new ones that represent sounds not found in Geez [5]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.