Abstract

The dataset consists of 20000 scanned catalogues of fossils and other artifacts compiled by the Geological Sciences Department. The images look like a scanned form filled with blue ink ball pen. The character extraction and identification is the first phase of the research and in the second phase we are planning to use the HMM model to extract the entire text from the form and store it in a digitized format. We used various image processing and computer vision techniques to extract characters from the 20000 handwritten catalogues. Techniques used for character extraction are Erode, MorphologyEx, Dilate, canny edge detection, find Counters, Counter Area etc. We used Histogram of Gradients (HOGs) to extract features from the character images and applied k-means and agglomerative clustering to perform unsupervised learning. This would allow us to prepare a labelled training dataset for the second phase. We also tried converting images from RGB to CMYK to improve k-means clustering performance. We also used thresholding to extract blue ink characters from the form after converting the image in HSV color format, but the background noise was significant, and results obtained were not promising. We are researching a more robust method to extract characters that doesn’t deform the characters and takes alignment into consideration.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call