Abstract

Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results. Database URL: https://biocaddie.org/benchmark-data

Highlights

  • The use of search engines in the Web as content indexing and as data providers, via e.g., hyperlinks and text snippets, such as in Google and PubMed, has changed the way digital libraries are managed from static catalogues and databases to dynamically changing collections in a highly distributed environment

  • To improve search of biomedical research datasets, in this work we investigate a novel dataset ranking pipeline that goes beyond the use of metadata available in DataMed

  • The biomedical community is increasingly developing data management systems based on advanced text analytics to cope with extremely large size and variety of datasets involved in scientific research [6,7,8,9,10,39]

Read more

Summary

Introduction

The use of search engines in the Web as content indexing and as data providers, via e.g., hyperlinks and text snippets, such as in Google and PubMed, has changed the way digital libraries are managed from static catalogues and databases to dynamically changing collections in a highly distributed environment. Systems, such as PubMed, PubMed Central (PMC) and Europe PMC, successfully provide platforms for retrieving and accessing information in the scientific literature, an essential step for the progress of biomedical sciences. Providing integrated access to research datasets fosters collaborations and speeds up scientific progress, and, with the advance in data analytics methods, allows the discovery of new knowledge from connected data [5]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call