Abstract

The literature on coronaviruses counts more than 300,000 publications. Finding relevant papers concerning arbitrary queries is essential to discovery helpful knowledge. Current best information retrieval (IR) use deep learning approaches and need supervised training sets with labeled data, namely to know a priori the queries and their corresponding relevant papers. Creating such labeled datasets is time-expensive and requires prominent experts’ efforts, resources insufficiently available under a pandemic time pressure. We present a new self-supervised solution, called SUBLIMER, that does not require labels to learn to search on corpora of scientific papers for most relevant against arbitrary queries. SUBLIMER is a novel efficient IR engine trained on the unsupervised COVID-19 Open Research Dataset (CORD19), using deep metric learning. The core point of our self-supervised approach is that it uses no labels, but exploits the bibliography citations from papers to create a latent space where their spatial proximity is a metric of semantic similarity; for this reason, it can also be applied to other domains of papers corpora. SUBLIMER, despite is self-supervised, outperforms the Precision@5 (P@5) and Bpref of the state-of-the-art competitors on CORD19, which, differently from our approach, require both labeled datasets and a number of trainable parameters that is an order of magnitude higher than our.

Highlights

  • We performed a series of tests to formally evaluate the entire system and its components as the language model

  • The goal was to analyze the different configurations of SUBLIMER against Co-Search and COVIDEX, state-of-the-art on CORD19 information retrieval

  • We tested our solution against the first round of the Text Retrieval Conference (TREC)-COVID test set, comparing our results with the state-of-the-art in this domain: CO-Search and COVIDEX

Read more

Summary

Introduction

Information retrieval systems play a central role in this situation because they can find semantically related documents in a vast collection against a human query. Such systems are built leveraging neural models, but training these models is trivial because they require a collection of papers pre-classified as relevant for a given set of queries or topics. For this reason, labeled datasets, where the relationships between documents and topics are previously known, are fundamental. Just a few domains have labelled data and their preparation is often unfeasible due to time constraints, economic resources required and human experts’ effort

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call