Abstract

In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one.

Highlights

  • Biomedical data include large datasets, with diverse types of information, that are managed by a wide range of biomedical research centers

  • We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon

  • We compared and evaluated two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon

Read more

Summary

Introduction

Biomedical data include large datasets, with diverse types of information, that are managed by a wide range of biomedical research centers. Inspired by the evaluation framework of the bioCADDIE challenge [2] (https://biocaddie.org/biocaddie-2016-datasetretrieval-challenge), in this work, we consider only the fields that we believe are the most relevant fields for that task These fields include [1] title, [2] description, [3] a list of keywords, [4] a list of organisms, [5] the titles of the associated research articles, [6] the abstracts of the associated research articles, [7] a list of genes, [8] a description of a disease and [9] a description of a treatment. The rest of this paper is organized as follows: we discuss the related work; we describe the bioCADDIE challenge dataset collection; we provide an architectural overview of our solution followed by the query expansion strategy we use; we present the experimental evaluation, followed by a discussion and a summary of key observations

Related work
20 Dryad x
Query category filter
Findings
Query expansion source
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call