Abstract

In this work, we compare and analyze a variety of approaches in the task of medical publication retrieval and, in particular, for the Technology Assisted Review (TAR) task. This problem consists in the process of collecting articles that summarize all evidence that has been published regarding a certain medical topic. This task requires long search sessions by experts in the field of medicine. For this reason, semi-automatic approaches are essential for supporting these types of searches when the amount of data exceeds the limits of users. In this paper, we use state-of-the-art models and weighting schemes with different types of preprocessing as well as query expansion (QE) and relevance feedback (RF) approaches in order to study the best combination for this particular task. We also tested word embeddings representation of documents and queries in addition to three different ranking fusion approaches to see if the merged runs perform better than the single models. In order to make our results reproducible, we have used the collection provided by the Conference and Labs Evaluation Forum (CLEF) eHealth tasks. Query expansion and relevance feedback greatly improve the performance while the fusion of different rankings does not perform well in this task. The statistical analysis showed that, in general, the performance of the system does not depend much on the type of text preprocessing but on which weighting scheme is applied.

Highlights

  • Information Retrieval (IR) is a research area that has seen massive growth in interest together with the growth of the internet

  • In neither case the fusion has been proved to be a clearly better approach, given all the results found we can say that the fusion increases the retrieved list of documents by less than 2% in terms of precision and Normalized Discounted Cumulative Gain (NDCG) which for us is not sufficient to say that it should be the preferred approach

  • We focused on the task of Retrieval of Medical Publications, using the Conference and Labs Evaluation Forum (CLEF) e-Health tracks as our tasks

Read more

Summary

Introduction

Information Retrieval (IR) is a research area that has seen massive growth in interest together with the growth of the internet. Following the growth in the performance of the computers and the development of ways to store data and retrieve it from memory, like databases, the field of IR acquired increasing importance in the computer science world, which culminated with the explosion of the world wide web. A document to be relevant to this query, should at least contain once these three words otherwise it is unlikely that it could be fully relevant for the user This approach, proposed by Luhn [6], is very simple, yet very powerful. If a document is composed by more than a few paragraphs, some words will have a very high count without being informative about the topic of the document These words are, for example, prepositions that do not distinguish a document from another, since both will contain a high frequency of the same words. The resolving power of a word is the ability of a word to identify and distinguish a document from another

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call