Recognising characters from text have been a popular topic in the computer vision area. The application can benefit to many problems in the world. For example: recognising text in documents, classifying the text or scripts of documents, plate recognition, etc. Many researchers have been developed the methods for recognising characters in by using Optical Character Recognition methods. Although text recognition problem using Optical Character Recognition has been more or less solved, most of the Optical Character Recognition problem explored is belong to Latin alphabet texts. Meanwhile, there are several languages have non-Latin scripts as the written text. Recognising a non-Latin script is quite challenging as the contour and shape of the text are relatively different with a Latin script text. This research aims to collect datasets for OCR in Javanese characters. A total of 5880 characters were collected and trained with several methods with Tesseract OCR tools. The models then be implemented to a mobile phone (Android based). The highest accuracy (97,50%) achieved by the model was achieved by combining single boundary box for the whole parts of the character and the separate boundary boxes in main body and sandangan parts.
Read full abstract