Multilingual Optical Character Recognition (OCR) is difficult to develop as different languages exhibit different writing and structural characteristics and it is very difficult to generalize their segmentation process. Character segmentation plays an important role in developing OCR for handwritten languages. The exactness of character segmentation is the integral factor of OCR. In this paper, we exploit this limitation and propose a approach based on the polygonal approximation of the word, which works on more than one Indian languages. This work depicts the novel approach for script independent character segmentation of handwritten text utilizing basic structural properties of the languages. Digitally straight line segments (DSS) of the word is obtained by applying Polygonal approximation to the word. The segmentation of character is language independent and works considerably with skew words as well. Experiments are carried out with four popular Indian languages, Hindi, Marathi, Punjabi, and Bangla. The average success rate for character segmentation of four languages is 90.07% which is satisfactory compared with other existing methods. We use shadow and cumulative stretch feature set with random forest, support vector machine (SVM), multi-layer perceptron (MLP), and convolutional neural network (CNN) classifiers for character recognition. On experimentation, it is observed that our proposed method provided good accuracy for character segmentation and recognition.
Read full abstract