Abstract

Data quality (DQ) assessment is essential for realizing the promise of big data by judging the value of data in advance. Relevance, an indispensable dimension of DQ, focusing on “fitness for requirement”, can arouse the user’s interest in exploiting the data source. It has two-level evaluations: (1) the amount of data that meets the user’s requirements; (2) the matching degree of these relevant data. However, there lack works of DQ assessment at dimension of relevance, especially for unstructured image data which focus on semantic similarity. When we try to evaluate semantic relevance between an image data source and a query (requirement), there are three challenges: (1) how to extract semantic information with generalization ability for all image data? (2) how to quantify relevance by fusing the quantity of relevant data and the degree of similarity comprehensively? (3) how to improve assessing efficiency of relevance in a big data scenario by design of an effective architecture?To overcome these challenges, we propose a semantic-aware data quality assessment (SDQA) architecture which includes off-line analysis and on-line assessment. In off-line analysis, for an image data source, we first transform all images into hash codes using our improved Deep Self-taught Hashing (IDSTH) algorithm which can extract semantic features with generalization ability, then construct a graph using hash codes and restricted Hamming distance, next use our designed Semantic Hash Ranking (SHR) algorithm to calculate the importance score (rank) for each node (image), which takes both the quantity of relevant images and the degree of semantic similarity into consideration, and finally rank all images in descending order of score. During on-line assessment, we first convert the user’s query into hash codes using IDSTH model, then retrieve matched images to collate their importance scores, and finally help the user determine whether the image data source is fit for his requirement. The results on public dataset and real-world dataset show effectiveness, superiority and on-line efficiency of our SDQA architecture.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.