OFFLINE OCR ALGORITHM TO DETECT KURDISH/ ARABIC CHARACTERS IN SCANNED DOCUMENT

Sardar Omar Salih

doi:10.25007/ajnu.v11n3a673

Abstract

In this paper, Algorithm named (MRWL) Max Rightmost White Line is proposed to detect Kurdish/ Arabic characters’ segmentation in scanned document (printed document), it works in preprocess and segmentation stages of OCR processes, these two stages are significant parts of OCR and affects the accuracy of algorithm. The MRWL starts to remove text margins around document to reduce processing time, then, scans to find Top Line (TL) and Bottom Line (BL) for each sentence in paragraph which can be used to measure height of characters. Based on TL and BL, the Base Line (BSL) can be detected using horizontally Most Frequency Black Pixel (MFBP) which is useful to find characters’ segmentation (Atallah and Omar, 2008) . Finding TL, BL and BSL of each sentence help to find characters location in document. Six phases involve in algorithm, each phase has its own functionally. The Algorithm is tested with different input documents and the average accurate rate of detected segmentations is recorded as 96.93%.

Full Text