Profiling the semantics of n-ary web table data

Oliver Lehmberg,Christian Bizer

doi:10.1145/3323878.3325806

Abstract

The Web contains millions of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with additional attributes or add missing facts to knowledge bases. Nearly all existing approaches for these tasks build upon the assumption that web table data consists of binary relations, meaning that an attribute value depends on a single key attribute, and that the key attribute value is contained in the HTML table. Inspecting randomly chosen tables on the Web, however, quickly reveals that both assumptions are wrong for a large fraction of the tables. In order to better understand the potential of non-binary web table data for downstream applications, this papers analyses a corpus of 5 million web tables originating from 80 thousand different web sites with respect to how many web table attributes are non-binary, what composite keys are required to correctly interpret the semantics of the non-binary attributes, and whether the values of these keys are found in the table itself or need to be extracted from the page surrounding the table. The profiling of the corpus shows that at least 38% of the relations are non-binary. Recognizing these relations requires information from the title or the URL of the web page in 50% of the cases. We find that different websites use keys of varying length for the same dependent attribute, e.g. one cluster of websites presents employment numbers depending on time, another cluster presents them depending on time and profession. By identifying these clusters, we lay the foundation for selecting Web data sources according to the specificity of the keys that are used to determine specific attributes.

Full Text