Abstract

AbstractAlong with the development of the data‐driven research paradigm, there are exponentially increasing datasets, which bring challenges to researchers in the efficient retrieval of relevant datasets. Previous studies mainly focused on query expansion methods based on sparse retrieval models to improve the accuracy and recall in retrieval. We investigated the use of semantically rich information to retrieve relevant datasets and the benefits of using domain‐specific dense vector representation as opposed to general representation. First, we used pairs of metadata fields that have semantic relevance to construct the domain‐specific weakly supervised training data. Then, a pre‐trained transformer‐based deep learning model is fine‐tuned on the training data using the contrastive learning method. Finally, dense vector representations of the queries and datasets are obtained based on the fine‐tuned model. The relevance of a dataset to a query is measured by the similarity between the vectors. To evaluate the performance of the proposed model, we collected 104,683 datasets from 13 research data repositories, recruited volunteers to design research‐oriented queries, and annotated the retrieval results. The experimental results show that compared with the domain‐independent fine‐tuned model, our proposed method can improve the NDCG@10 score by about 5%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.