Abstract
Over 70,000 historical books exist in Qatar's Heritage books collection that forms an invaluable part of the Arabic Heritage. The digitization of these books will help to improve accessibility to these texts while ensuring their preservation. The aim of this project is to explore Optical Character Recognition (OCR) techniques for digitizing historical Arabic texts. In this project, the techniques for improving the OCR pipeline were explored in three stages. First, an exploration of page layout analysis was conducted. Next, new Arabic Language translation models were built and the recognition rates were analyzed. Finally, an analysis of using various language models for OCR was conducted. An important initial step in the OCR pipeline is the page layout analysis which requires the identification and classification of regions of interest from a scanned page. In many historic Arabic texts scholars have written side notes on the page margins which add significant value to the main content. Thus, an exploration of techniques was conducted to improve the identification of side notes during the digitization of historic Arabic texts. First, an evaluation of text line segmentation was conducted using two notable open source OCR software: OCRopus and Tesseract. Next, geometric layout analysis techniques were explored using OCRopus and MATLAB to identify text line orientation. After the layout analysis, the next step in the OCR pipeline is the recognition of words and characters from the different text lines that are segmented in the page layout step. OCRopus was the main open source OCR software analyzed which directly extracted the characters from the segmented lines. A number of character recognition models were created for extensive training of the OCRopus system. The historical Arabic text data was then tested on the trained OCRopus models for the calculation of character recognition rates. Additionally, another Open source tool called the IMPACT D-TR4.1 was tested to check the accuracy of clustering within the characters of the Historical Arabic text. A later stage in OCR after the recognition of characters is word boundary identification. In written Arabic, spaces appear between individual words, and possibly within a word, which makes word boundary identification problem difficult. This part of the project assumes character level OCR and proceeds from there. For a given stream of characters, word boundaries are to be identified using perplexities of a Language Model (LM), on Character level and Word level. Character level language model is explored in two ways: the first approach uses segment program supported by SriLM toolkit (Stolcke, 2002). The second approach maps the segmentation to an SMT problem, and uses MOSES. Word level language model is also explored in two ways: the first is naive approach, where all possible prior word boundaries are explored per word, and the one with highest probability is chosen. The second approach uses dynamic programming to find the overall boundaries placement to minimize cost, i.e. maximize probability. This work is the result of a project done at QCRI's 2013 summer internship program.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have