Abstract

Raster-image PDF files originating from scanning or photographing paper documents are inaccessible to both text search engines and screen readers that people with visual impairments use. We here focus on the relatively less-researched problem of converting raster-image files with Arabic script into machine-accessible documents. Our method, called ECDP for "Ensemble-based classification of document patches," segments the physical layout of the document, classifies image patches as containing text or graphics, assembles homogeneous document regions, and passes the text to an optical character recognition engine to convert into natural language. Classification is based on the majority voting of an ensemble of support vector machines. When tested on the dataset BCE-Arabic [Saad et al. in: ACM 9th annual international conference on pervasive technologies related to assistive environments (PETRA'16), Corfu, 2016], ECDP yielded an average patch classification accuracy of 97.3% and average $$F_1$$F1 score of 95.26% for text patches and efficiently extracted text zones in both paragraphs and text-embedded graphics, even if the text is rotated by $$90^{\circ }$$90ź or is in English. ECDP outperforms a classical layout analysis method (RLSA) and a state-of-the-art commercial product (RDI-CleverPage) on this dataset and maintains a relatively high level of performance on document images drawn from two other datasets (Hesham et al. in Pattern Anal Appl 20:1275---1287, 2017; Proprietary Dataset of 109 Arabic Documents. http://www.rdi-eg.com). The results suggest that the proposed method has the potential to generalize well to the analysis of documents with a broad range of content.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.