Learning Domain‐specific Semantic Representation from Weakly Supervised Data to Improve Research Dataset Retrieval

Pengcheng Luo,Zheng Gao,Xin Guo,Shiqi Wang,Jimin Wang,Sang Wouk Cho,Lingzi Hong

doi:10.1002/pra2.616

Abstract

AbstractAlong with the development of the data‐driven research paradigm, there are exponentially increasing datasets, which bring challenges to researchers in the efficient retrieval of relevant datasets. Previous studies mainly focused on query expansion methods based on sparse retrieval models to improve the accuracy and recall in retrieval. We investigated the use of semantically rich information to retrieve relevant datasets and the benefits of using domain‐specific dense vector representation as opposed to general representation. First, we used pairs of metadata fields that have semantic relevance to construct the domain‐specific weakly supervised training data. Then, a pre‐trained transformer‐based deep learning model is fine‐tuned on the training data using the contrastive learning method. Finally, dense vector representations of the queries and datasets are obtained based on the fine‐tuned model. The relevance of a dataset to a query is measured by the similarity between the vectors. To evaluate the performance of the proposed model, we collected 104,683 datasets from 13 research data repositories, recruited volunteers to design research‐oriented queries, and annotated the retrieval results. The experimental results show that compared with the domain‐independent fine‐tuned model, our proposed method can improve the NDCG@10 score by about 5%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Learning Domain‐specific Semantic Representation from Weakly Supervised Data to Improve Research Dataset Retrieval

Abstract

Talk to us

Similar Papers

More From: Proceedings of the Association for Information Science and Technology

Lead the way for us

Journal: Proceedings of the Association for Information Science and Technology	Publication Date: Oct 1, 2022
Citations: 3

Similar Papers

Incorporating word embeddings in unsupervised morphological segmentation
Ahmet Üstün ... Burcu Can
Natural Language Engineering | VOL. 27
Ahmet Üstün, et. al.Ahmet Üstün ... Burcu Can
10 Jul 2020
Natural Language Engineering | VOL. 27

Deep Learning Improves Speed and Accuracy of Prostate Gland Segmentations on Magnetic Resonance Imaging for Targeted Biopsy.
Simon John Christoph Soerensen ... Mirabela Rusu
Journal of Urology | VOL. 206
Simon John Christoph Soerensen, et. al.Simon John Christoph Soerensen ... Mirabela Rusu
21 Apr 2021
Journal of Urology | VOL. 206

LSTM Based Paraphrase Identification Using Combined Word Embedding Features
D Aravinda Reddy ... M Anand Kumar
-
D Aravinda Reddy, et. al.D Aravinda Reddy ... M Anand Kumar
01 Jan 2019
01 Jan 2019

Gradient Art: Creation and Vectorization
Pascal Barla ... Adrien Bousseau
-
Pascal Barla, et. al.Pascal Barla ... Adrien Bousseau
30 Oct 2012
30 Oct 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning Domain‐specific Semantic Representation from Weakly Supervised Data to Improve Research Dataset Retrieval

Abstract

Talk to us

Similar Papers

More From: Proceedings of the Association for Information Science and Technology