QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION

A B Nugumanova,M Mansurova,Y M Baiburin,K S Apayev,A G Ospan

doi:10.26577/jmmcs.2022.v114.i2.08

A B Nugumanova, M Mansurova + Show 3 more

Open Access

https://doi.org/10.26577/jmmcs.2022.v114.i2.08

Copy DOI

Abstract

In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently being developed. Nevertheless, most of these tools are focused on solving somespecific tasks, for example, only on recognizing tables presented in the form of images. Wepropose combining these tasks into a single pipeline that will support the full cycle of processingtables – starting with the stages of their search, recognition and extraction and ending with thestages of semantic analysis and interpretation, i.e. understanding tables (table understanding).Understanding tables and replenishing (population) knowledge bases (knowledge graphs) withmeaningful information contained in these tables is the ultimate goal of our design. The firstpart of the work presents methods for detecting tables on web pages, pdf documents, as well asautomatic detection of attributes and values of objects. The second part presents the architectureof the Qurma tool and its structure. The results show the implementation of the parser for theAlmaty-Ust-Kamenogorsk air search theme.

Full Text