Determining the Similarity of Research Data by Using an Interoperable Metadata Extraction Method

Benedikt Heinrichs,M Amin Yazdi

doi:10.52825/cordi.v1i.290

Benedikt Heinrichs, M Amin Yazdi

Open Access

https://doi.org/10.52825/cordi.v1i.290

Copy DOI

Abstract

Determining the similarity of research data is not a simple task, as the formats can differ widely depending on the domain. Especially, since many formats are represented as binary files, the raw comparison of these will not yield good results. This makes it hard to accurately tell how similar certain research work is by comparing the data. With the emergence of extracted interoperable metadata, a form to describe data has been provided which is independent of the data format. Therefore, this work tries to use this extracted interoperable metadata and create a method to determine the similarity of research data based on their metadata. The produced method utilizes domain knowledge about the extracted metadata and the way they are formulated. A baseline is created, and further methods are created to compare to. The results show that our method outperforms all other methods, especially the ones which are focused on comparing the research data itself, not the metadata. Since the results are promising, we propose further investigations against other datasets and possible use cases.

Full Text