Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval.

Payam Karisani,Zhaohui S Qin,Eugene Agichtein

doi:10.1093/database/bax104

Payam Karisani, Zhaohui S Qin + Show 1 more

Open Access

https://doi.org/10.1093/database/bax104

Copy DOI

Abstract

The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie

Highlights

Background and motivationWith rapid technological development such as DNA sequencing and brain imaging, ever increasing volumes of massive datasets have been produced
The results indicate that given the verbose queries, and in the presence of an effective keyword detection method, we are unable to gain a significant benefit from the Blind Relevance Feedback’ (BRF) expansion method
The results show that applying learning to rank (LTR) in the scarce training data environment causes overfitting, and the final model causes 5.1% degradation in NDCG, compared to the IROpt (The reason that we observe some difference in IROpt models in Tables 4 and 5 is that, as mentioned in the section Experimental setup, for the LTR part we fixed all the information retrieval parameters in Tables 2 and 3 and assumed there is a universal tuned parameter settings which can be used in the domain.) system

Summary

Introduction

Background and motivationWith rapid technological development such as DNA sequencing and brain imaging, ever increasing volumes of massive datasets have been produced. The NCBI Gene Expression Omnibus has to-date (November 2017) archived >91 000 experimental studies, which comprise >2 million samples. Such massive amounts of openly accessible data offer unprecedented opportunities to advance our understanding of biology, human health and diseases. In Eric Green’s presentation on ‘NIH and Biomedical ‘Big Data,’ the first ‘major problems to solve’ for big data is ‘Locating the data.’. This is the challenge on which we focus in this paper: developing and evaluating techniques for finding relevant biomedical datasets In a perspective article, which describes NIH’s vision of Big Data to Knowledge (BD2K) [1], Margolis et al pointed out that ‘A fundamental question for BD2K is how to enable the identification, access and citation of (i.e. credit for) biomedical data.’ In Eric Green’s presentation on ‘NIH and Biomedical ‘Big Data,’ the first ‘major problems to solve’ for big data is ‘Locating the data.’ This is the challenge on which we focus in this paper: developing and evaluating techniques for finding relevant biomedical datasets

Objectives

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database : the journal of biological databases and curation	Publication Date: Jan 1, 2018
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database : the journal of biological databases and curation

Lead the way for us

Similar Papers

Predicting query performance in microblog retrieval
Jesus A Rodriguez Perez ... Joemon M Jose
-
Jesus A Rodriguez Perez, et. al.Jesus A Rodriguez Perez ... Joemon M Jose
03 Jul 2014
03 Jul 2014

Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts.
Hongfang Liu ... Yanshan Wang
Database : the journal of biological databases and curation | VOL. 2017
Hongfang Liu, et. al.Hongfang Liu ... Yanshan Wang
01 Jan 2017
Database : the journal of biological databases and curation | VOL. 2017

Improving verbose queries using subset distribution
Xiaobing Xue ... W Bruce Croft
-
Xiaobing Xue, et. al.Xiaobing Xue ... W Bruce Croft
26 Oct 2010
26 Oct 2010

Machine learning-based modeling approaches for estimating pyrolysis products of varied biomass and operating conditions
Jiangfeng Shen ... Xi Gao
Bioresource Technology Reports | VOL. 20
Jiangfeng Shen, et. al.Jiangfeng Shen ... Xi Gao
12 Nov 2022
Bioresource Technology Reports | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database : the journal of biological databases and curation