Abstract

Relational tables collected from HTML pages (web tables) are used for a variety of tasks including table extension, knowledge base completion, and data transformation. Most of the existing algorithms for these tasks assume that the data in the tables has the form of binary relations, i.e., relates a single entity to a value or to another entity. Our exploration of a large public corpus of web tables, however, shows that web tables contain a large fraction of non-binary relations which will likely be misinterpreted by the state-of-the-art algorithms. In this paper, we propose a categorisation scheme for web table columns which distinguishes the different types of relations that appear in tables on the Web and may help to design algorithms which better deal with these different types. Designing an automated classifier that can distinguish between different types of relations is non-trivial, because web tables are relatively small, contain a high level of noise, and often miss partial key values. In order to be able to perform this distinction, we propose a set of features which goes beyond probabilistic functional dependencies by using the union of multiple tables from the same web site and from different web sites to overcome the problem that single web tables are too small for the reliable calculation of functional dependencies.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.