Amharic Character Recognition Based on Features Extracted by CNN and Auto-Encoder Models

Efrem Yohannes Obsie,Hongchun Qu,Qingqin Huang

doi:10.1145/3474963.3474972

Abstract

Amharic is an ancient Semitic language that serves as the official language of the Federal Republic of Ethiopia. Due to the large number of historical and literary documents written in this language, an automated OCR system is highly demanded. However, previous approaches have been based on traditional machine learning algorithms that focus on hand-crafted feature extraction, and the performance of these methods is greatly affected by the presence of a large set of structurally similar characters. Therefore, according to various studies on Amharic character, this problem can be solved by examining robust feature extraction techniques. In this study, we proposed a hybrid method that uses deep learning models Convolutional Neural Network (CNN) and Convolutional Auto-Encoder (CAE) for feature extraction, Random Forest (RF) and Mutual Information (MI) feature selection methods for selecting top features and a traditional machine learning algorithm Support Vector Machine (SVM) for classification. First, the features extracted by the two deep models were combined to form hybrid features, and then top features were selected by applying feature selection. The common features selected by the two feature selection methods were later used for recognition by SVM. Experimental results using CNN extracted features achieved an accuracy of 96.03% while using CAE extracted features achieved an accuracy of 92.52%. On the other hand, the proposed method based on the intersection features selected by the RF and MI feature selection methods achieved an accuracy of 97.06%.

Full Text