Exploring the use of topological data analysis to automatically detect data quality faults.

M Eduard Tudoreanu

doi:10.3389/fdata.2022.931398

Abstract

Data quality problems may occur in various forms in structured and semi-structured data sources. This paper details an unsupervised method of analyzing data quality that is agnostic to the semantics of the data, the format of the encoding, or the internal structure of the dataset. A distance function is used to transform each record of a dataset into an n-dimensional vector of real numbers, which effectively transforms the original data into a high-dimensional point cloud. The shape of the point cloud is then efficiently examined via topological data analysis to find high-dimensional anomalies that may signal quality issues. The specific quality faults examined in this paper are the detection of records that, while not exactly the same, refer to the same entity. Our algorithm, based on topological data analysis, provides similar accuracy for both higher and lower quality data and performs better than a baseline approach for data with poor quality.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Exploring the use of topological data analysis to automatically detect data quality faults.

Abstract

Talk to us

Similar Papers

More From: Frontiers in big data

Lead the way for us

Journal: Frontiers in big data	Publication Date: Dec 5, 2022
License type: CC BY 4.0

Similar Papers

Topographical data analysis to identify high-density clusters in stroke patients undergoing post-acute rehabilitation
Eliezer Bose ... Qing Mei Wang
Topics in Stroke Rehabilitation | VOL. 28
Eliezer Bose, et. al.Eliezer Bose ... Qing Mei Wang
29 Oct 2020
Topics in Stroke Rehabilitation | VOL. 28

A New Approach to Investigate the Association between Brain Functional Connectivity and Disease Characteristics of Attention-Deficit/Hyperactivity Disorder: Topological Neuroimaging Data Analysis.
Sunghyon Kyeong ... Jae-Jin Kim
PLOS ONE | VOL. 10
Sunghyon Kyeong, et. al.Sunghyon Kyeong ... Jae-Jin Kim
09 Sep 2015
PLOS ONE | VOL. 10

Applied Computational Topology for Point Clouds and Sparse Timeseries Data

-

01 Jan 2017
01 Jan 2017

Minimizing the data quality problem of information systems: A process-based method
Qi Liu ... Wenlong Wang
Decision Support Systems | VOL. 137
Qi Liu, et. al.Qi Liu ... Wenlong Wang
08 Aug 2020
Decision Support Systems | VOL. 137

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploring the use of topological data analysis to automatically detect data quality faults.

Abstract

Talk to us

Similar Papers

More From: Frontiers in big data