Abstract

The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. Throughout 2020, over 400,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset. Here, we present CO-Search, a semantic, multi-stage, search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers and avoiding misinformation during a time of crisis. CO-Search is built from two sequential parts: a hybrid semantic-keyword retriever, which takes an input query and returns a sorted list of the 1000 most relevant documents, and a re-ranker, which further orders them by relevance. The retriever is composed of a deep learning model (Siamese-BERT) that encodes query-level meaning, along with two keyword-based models (BM25, TF-IDF) that emphasize the most important words of a query. The re-ranker assigns a relevance score to each document, computed from the outputs of (1) a question–answering module which gauges how much each document answers the query, and (2) an abstractive summarization module which determines how well a query matches a generated summary of the document. To account for the relatively limited dataset, we develop a text augmentation technique which splits the documents into pairs of paragraphs and the citations contained in them, creating millions of (citation title, paragraph) tuples for training the retriever. We evaluate our system (http://einstein.ai/covid) on the data of the TREC-COVID information retrieval challenge, obtaining strong performance across multiple key information retrieval metrics.

Highlights

  • The evolution of the SARS-CoV-2 virus, with its unique balance of virulence and contagiousness, has resulted in the COVID-19 pandemic

  • We evaluate CO-Search on data from the Text Retrieval Conference (TREC)-COVID challenge10—a five-round information retrieval (IR) competition for COVID-19 search engines—using several standard IR metrics: normalized discounted cumulative gain, precision with N

  • The evaluation dataset consists of topics, CORD-19 corpus is CovidQA, which includes a small number of questions from the CORD-19 tasks[12]

Read more

Summary

INTRODUCTION

The evolution of the SARS-CoV-2 virus, with its unique balance of virulence and contagiousness, has resulted in the COVID-19 pandemic. CO-Search indexes content from over 400,000 scientific papers made available through the COVID-19 Open Research Dataset Challenge (CORD-19)9—an initiative put forth by the US White House and other prominent institutions in early 2020 The goal of this line of work is to offer an alternative, scientific search engine, designed to limit misinformation in a time of crisis. The CORD-199 coronavirus-related literature corpus, primarily more general neural IR engines[24,25] because of the relatively from PubMed, mostly published in 2020, has quickly generated a number of data science and computing works[11] These cover topics from IR to natural language processing (NLP), including applications in question answering[12], text summarization, and document search[10]. SLEDGE16 extends this by using SciBERT17—the scientific text-trained version of the prominent BERT18 NLP model— finetuned on MS MARCO, to re-rank articles retrieved with BM25

RESULTS
Esteva et al 3
Evaluation
DISCUSSION
RqCðdÞ þ k
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.