Abstract

The Data on the Web Best Practices Working Group, as part of W3C Data Activity, is standardizing the Data Quality Vocabulary (DQV) for expressing data quality of datasets published on the Web. By exploiting such DQV-based quality metadata associated to the datasets in a data portal, data consumers can achieve data quality-based filtering and ranking of datasets on the portal's conventional search results to obtain desired datasets with high data-quality. Despite the significant progress in standardization, there is a lack of systematic research on approaches and tools for data quality-based filtering and ranking of Web published datasets. This paper therefore proposes a generic software framework for Data Quality-based Filtering and Ranking of Datasets (DQFIRD) in data portals. DQFIRD adopts faceted search (or faceted exploration) techniques to filter the search results of a data portal based on quality metadata about the resulting datasets, and then ranks the filtered datasets according to numeric values of quality measurements in the metadata. We designed the main algorithms of DQFIRD and implemented a prototype of DQFIRD using Java and Jena API. Furthermore, we used the prototype to conduct case study experiments and time efficiency test on the Faceted Taxonomy Materialization (FTM) algorithm, the most time-consuming online operation algorithm in DQFIRD. The results indicate that the proposed DQFIRD approach is implementable and effective, and it has low time complexity because the run-time of the FTM algorithm exhibits approximately a linear growth rate as the size of the relevant dataset quality metadata increases.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call