Governments are embracing an open data philosophy and making their data freely available to the public to encourage innovation and increase transparency. However, the number of available datasets is still limited. Finding relationships between related datasets on different data portals enables users to search the relevant datasets. These datasets are generated from the training data, which need to be curated by the user query. However, relevant dataset retrieval is an expensive operation due to the preparation procedure for each dataset. Moreover, it requires a significant amount of space and time. In this study, we propose a novel framework to identify the relationships between datasets using structural information and semantic information for finding similar datasets. We propose an algorithm to generate the Concept Matrix (CM) and the Dataset Matrix (DM) from the concepts and the datasets, which is then used to curate semantically related datasets in response to the users’ submitted queries. Moreover, we employ the proposed compression, indexing, and caching algorithms in our proposed scheme to reduce the required storage and time while searching the related ranked list of the datasets. Through extensive evaluation, we conclude that the proposed scheme outperforms the existing schemes.
Read full abstract