Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data

Yu Liu,Yanzhao Xie,Zhili Xiao,Chan Guo,Yangtao Wang,Lianli Gao

doi:10.1145/3420038

Abstract

Data mining can hardly solve but always faces a problem that there is little meaningful information within the dataset serving a given requirement. Faced with multiple unknown datasets, to allocate data mining resources to acquire more desired data, it is necessary to establish a data quality assessment framework based on the relevance between the dataset and requirements. This framework can help the user to judge the potential benefits in advance, so as to optimize the resource allocation to those candidates. However, the unstructured data (e.g., image data) often presents dark data states, which makes it tricky for the user to understand the relevance based on content of the dataset in real time. Even if all data have label descriptions, how to measure the relevance between data efficiently under semantic propagation remains an urgent problem. Based on this, we propose a Deep Hash-based Relevance-aware Data Quality Assessment framework, which contains off-line learning and relevance mining parts as well as an on-line assessing part. In the off-line part, we first design a Graph Convolution Network (GCN)-AutoEncoder hash (GAH) algorithm to recognize the data (i.e., lighten the dark data), then construct a graph with restricted Hamming distance, and finally design a Cluster PageRank (CPR) algorithm to calculate the importance score for each node (image) so as to obtain the relevance representation based on semantic propagation. In the on-line part, we first retrieve the importance score by hash codes and then quickly get the assessment conclusion in the importance list. On the one hand, the introduction of GCN and co-occurrence probability in the GAH promotes the perception ability for dark data. On the other hand, the design of CPR utilizes hash collision to reduce the scale of graph and iteration matrix, which greatly decreases the consumption of space and computing resources. We conduct extensive experiments on both single-label and multi-label datasets to assess the relevance between data and requirements as well as test the resources allocation. Experimental results show our framework can gain the most desired data with the same mining resources. Besides, the test results on Tencent1M dataset demonstrate the framework can complete the assessment with a stability for given different requirements.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data

Abstract

Talk to us

Similar Papers

More From: ACM/IMS Transactions on Data Science

Lead the way for us

Journal: ACM/IMS Transactions on Data Science	Publication Date: Apr 8, 2021
Citations: 2

Similar Papers

An ERP Data Quality Assessment Framework for the Implementation of an APS system using Bayesian Networks
Jan-Phillip Herrmann ... Jörg Böhme
Procedia Computer Science | VOL. 200
Jan-Phillip Herrmann, et. al.Jan-Phillip Herrmann ... Jörg Böhme
01 Jan 2021
Procedia Computer Science | VOL. 200

Application of openEHR archetypes to automate data quality rules for electronic health records: a case study
Qi Tian ... Jiye An
BMC Medical Informatics and Decision Making | VOL. 21
Qi Tian, et. al.Qi Tian ... Jiye An
03 Apr 2021
BMC Medical Informatics and Decision Making | VOL. 21

A Class Based Approach for Utilizing Existing Data to Demonstrate Compliance and Equipment Availability
Are Torstensen ... Michael John
-
Are Torstensen, et. al.Are Torstensen ... Michael John
01 May 2017
01 May 2017

Tourism Satellite Account as a tool for enhancing labour market statistics in tourism industry (applied to tourism sector in Egypt)
Ahmad Ragab
European Journal of Tourism Research | VOL. 4
Ahmad RagabAhmad Ragab
01 Oct 2011
European Journal of Tourism Research | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data

Abstract

Talk to us

Similar Papers

More From: ACM/IMS Transactions on Data Science