أرشيف الصحافة المصرية في «مركز الدراسات والوثائق الاقتصادية والقانونية والاجتماعية». دراسة حالة تمثل تحدياً في مجال التعرُّف الضوئي على الحروف العربية

Rami Khalil Rouchdi,Hala Bayoumi,Mohamed Ahmed Ellotf

doi:10.4000/ema.13156

Abstract

This paper evaluates three commercial Arabic Optical Character Recognition (OCR) systems: Sakhr Automatic Reader (AR) version 11.2 gold; Abbyy FineReader (FR) version 12; and NovoVerus (NV) version 4.2.0 for the digitization of press archives having degraded text quality. In contrast to other similar attempts, we developed our own dataset to study the best specifications and tools in order to realize highest accuracy in the Egyptian press archive project of the Centre d'Études et de Documentation Économiques, Juridiques et Sociales (CEDEJ). We describe the approach of developing the dataset, as well as the effect of different image specifications on the OCR accuracy. Our dataset consists of 30 press-clips that represent different qualities, in terms of image background, text size and other effects due to age and storage. Each sample was scanned in different resolutions and color modes, to produce a set of 180 samples (six versions of each press-clip), then fed to the OCR suites, to evaluate its recognition accuracy. Then, we replicated the procedure that produced the highest consistent OCR accuracy on more than one million press-clips, the corpus of the CEDEJ project, and evaluated its results. In this paper, we mainly introduce an approach to digitize and OCR documents having low quality Arabic textual content, which guarantees high and consistent accuracy. Our approach is based on evaluating OCR suites performance against different image capturing and manipulation specifications.

Full Text