When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking

Pei Li,Chaofan Dai,Wenqian Wang

doi:10.3390/sym11040575

Abstract

In banks, governments, and internet companies, due to the increasing demand for data in various information systems and continuously shortening of the cycle for data collection and update, there may be a variety of data quality issues in a database. As the expansion of data scales, methods such as pre-specifying business rules or introducing expert experience into a repair process are no longer applicable to some information systems requiring rapid responses. In this case, we divided data cleaning into supervised and unsupervised forms according to whether there were interventions in the repair processes and put forward a new dimension suitable for unsupervised cleaning in this paper. For weak logic errors in unsupervised data cleaning, we proposed an attribute correlation-based (ACB)-Framework under blocking, and designed three different data blocking methods to reduce the time complexity and test the impact of clustering accuracy on data cleaning. The experiments showed that the blocking methods could effectively reduce the repair time by maintaining the repair validity. Moreover, we concluded that the blocking methods with a too high clustering accuracy tended to put tuples with the same elements into a data block, which reduced the cleaning ability. In summary, the ACB-Framework with blocking can reduce the corresponding time cost and does not need the guidance of domain knowledge or interventions in repair, which can be applied in information systems requiring rapid responses, such as internet web pages, network servers, and sensor information acquisition.

Highlights

Data cleaning means the examination and repair of identifiable errors by manual or technical means to improve data quality [1]
The regression-based method (RBM) repairs data according to the idea of multiple regression and builds a multiple regression model between other attributes and the erroneous data attributes in the dataset to get the target value of repair, in which the text attribute and the numerical attribute are respectively calculated with the edit distance and Euclidean distance
We believe that the blocking methods can significantly reduce the original repair time, but the repair ability will be reduced to a certain extent

Summary

Introduction

Data cleaning means the examination and repair of identifiable errors by manual or technical means to improve data quality [1]. Referring to supervised learning [4,5] and unsupervised learning [6,7,8] in machine learning, we divide the data cleaning into two different forms: supervised and unsupervised data cleaning.

Objectives

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Symmetry	Publication Date: Apr 19, 2019
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Symmetry

Lead the way for us

Similar Papers

Data Cleaning: A Practical Perspective
Venkatesh Ganti ... Anish Das Sarma
Synthesis Lectures on Data Management | VOL. 5
Venkatesh Ganti, et. al.Venkatesh Ganti ... Anish Das Sarma
21 Sep 2013
Synthesis Lectures on Data Management | VOL. 5

Data mining techniques for data cleaning
Kalaivany Natarajan ... Andy Koronios
-
Kalaivany Natarajan, et. al.Kalaivany Natarajan ... Andy Koronios
01 Jan 2009
01 Jan 2009

Innovation budget pressure, quality of IS information, and departmental performance
Alan S Dunk
The British Accounting Review | VOL. 39
Alan S DunkAlan S Dunk
19 Apr 2007
The British Accounting Review | VOL. 39

Data Cleaning in Knowledge Discovery Database (KDD)-Data Mining
...
-
, et. al. ...
09 Mar 2014
09 Mar 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Symmetry