This research work presents a unique dataset for offline handwritten Sindhi character recognition. It has 7800 character images in total, divided into multiple categories by 150 writers of various ages, genders, and professional backgrounds. Each writer writes the 52 Sindhi characters in the designed form. With a high-quality scanner, all of the written samples were scanned. After that, all the handwritten Sindhi characters were cropped from the collected designed form, and the cropped images were saved in ‘.png’ format. For the benefit of the Sindhi research community, this work suggests an image dataset for character recognition in handwritten Sindhi. The dataset will be made publically available. For the Sindhi language, this dataset can be used to create and test handwritten character recognition systems and provide helpful insights through writer identification. The dataset has been divided into the training set and the test set, with 80% for training and 20% for testing. The different preprocessing techniques used to remove noise from scanned images to create a clean dataset. The dataset created as a result of this research is the world's first openly accessible dataset for handwritten research, and it can be useful for writer identification systems and handwriting recognition systems.
Read full abstract