<p>Optical character recognition (OCR) for regional languages is difficult due to their complex orthographic structure, lack of dataset resources, a greater number of characters and similarity in structure between characters. Telugu is popular language in states of Andhra and Telangana. Telugu exhibits distinct separation between characters within a word, making a character-level dataset sufficient. With a smaller dataset, we can effectively recognize more words. However, challenges arise during the training of compound characters, which are combinations of vowels and consonants. These are considered as two or more characters based on associated vattus and dheerghams with the base character. To address this challenge, each compound character is encoded into a numerical value and used as input during training, with subsequent retrieval during recognition. The segmentation issue arises from overlapping characters caused by varying handwritten styles. For handling segmentation issues at the character level arising from handwritten styles, we have proposed an algorithm based on the language's features. To enhance word-level accuracy a dictionary-based model was devised. A neural network utilizing the inception module is employed for feature extraction at various scales, achieving word-level accuracy rates of 78% with fewer trainable parameters.</p>
Read full abstract