Data-driven domain discovery for structured datasets

Masayo Ota,Juliana Freire,Divesh Srivastava,Heiko Müller

doi:10.14778/3384345.3384346

Abstract

The growing number of open datasets has created new opportunities to derive insights and address important societal problems. These data, however, often come with little or no metadata, in particular about the types of their attributes, thus greatly limiting their utility. In this paper, we address the problem of domain discovery : given a collection of tables, we aim to identify sets of terms that represent instances of a semantic concept or domain. Knowledge of attribute domains not only enables a richer set of queries over dataset collections, but it can also help in data integration. We propose a data-driven approach that leverages value co-occurrence information across a large number of dataset columns to derive robust context signatures and infer domains. We discuss the results of a detailed experimental evaluation, using real urban dataset collections, which show that our approach is robust and outperforms state-of-the-art methods in the presence of incomplete columns, heterogeneous or erroneous data, and scales to datasets with several million distinct terms.

Full Text