Identifying Multiple Entity Columns in Web Tables

Ning Wang,Xiangran Ren

doi:10.1142/s0218194018500109

Abstract

Unlike tables in relational database, web tables have no designated key attributes or entity columns, so it is difficult for computers to understand a table and associate with it a concept in the knowledge taxonomy. Existing techniques for entity column detection can only process tables with single entity column, discarding tables which describe multiple concepts. In this paper, we propose a framework for identifying multiple entity columns in a web table. At first, we annotate column labels for a web table with missing or noninformative labels based on external knowledge base Probase. By detecting concept-attribute relationships between table columns and calculating the credibility of attribute dependency, we construct a column dependency view for the table. Then, the column semantic intensity is calculated for each column in a web table, which depends on its connectivity in column dependency view and the dependency credibility of attribute dependency relationships related to it. We can identify all entity columns from the web table by iteratively selecting primary entity column with the highest column semantic intensity and accordingly separate columns describing the primary concept from present column dependency view. The results of a comprehensive set of experiments indicate that our entity detection method is more effective than existing methods for either single or multiple concept tables.

Full Text