Towards a Hybrid Imputation Approach Using Web Tables

Ahmad Ahmadov,Robert Wrembel,Maik Thiele,Julian Eberius,Wolfgang Lehner

doi:10.1109/bdc.2015.38

Abstract

Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Towards a Hybrid Imputation Approach Using Web Tables

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Using item response theory to develop a shortened practice environment scale of the nursing work index.
Aoyjai P Montgomery ... Caitlin M Campbell
Research in Nursing & Health | VOL. 46
Aoyjai P Montgomery, et. al.Aoyjai P Montgomery ... Caitlin M Campbell
30 May 2023
Research in Nursing & Health | VOL. 46

Metadata Management in a Multiversion Data Warehouse
Robert Wrembel ... Bartosz Bębel
-
Robert Wrembel, et. al.Robert Wrembel ... Bartosz Bębel
01 Jan 2004
01 Jan 2004

DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata
Waran Taveekarn ... Supisara Sukkanta
-
Waran Taveekarn, et. al.Waran Taveekarn ... Supisara Sukkanta
01 Jul 2019
01 Jul 2019

On querying versions of multiversion data warehouse
Tadeusz Morzy ... Robert Wrembel
-
Tadeusz Morzy, et. al.Tadeusz Morzy ... Robert Wrembel
12 Nov 2004
12 Nov 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards a Hybrid Imputation Approach Using Web Tables

Abstract

Talk to us

Similar Papers