Separating tables from text and non-text objects in printed documents for digital reconstruction

M.A.C Akmal Jahan,R.G Ragel

doi:10.1109/iciinfs.2017.8300358

Abstract

Reproducing printed documents with their existing format is becoming an important task when we need to reprint or republish existing printed documents with the latest updates of contents. Text, images, charts, graphs, tables, logos and signatures are some of the prominent components of a printed document. When reproducing the text of a printed document, optical character recognition techniques can address the text related issues such as detection, recognition, and reconstruction. Beyond that, tables can be made editable if it is detected correctly, recognized and positioned in the document. Researchers who have addressed table-related issues have only dealt with tables when they are in line with the text. There is rare work in locating tables when a printed material has text, tables, and other non-text components all in the same document. A few available works treat tables as non-text objects and, therefore, no mechanisms were presented to separate them from the non-text elements. Considering tables as non-text component make it impossible to edit and update them in the usual way how tables are edited and updated. Therefore, our work addresses this issue by separating tables, text and other non-text objects within a document image using rule based thresholds and then by reconstructing the tables with the extracted features. From the experiment carried out on about 480 document images, we have achieved 81% of automated table detection accuracy when a printed material has text, table, and non-text components. Moreover, we have achieved about 90% accuracy when we input the type information of the document manually.

Full Text