Protocol for a reproducible experimental survey on biomedical sentence similarity.

Alicia Lara-Clares,Juan J Lastra-Díaz,Ana Garcia-Serrano

doi:10.1371/journal.pone.0248663

Alicia Lara-Clares, Juan J Lastra-Díaz + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0248663

Copy DOI

Journal: PloS one	Publication Date: Mar 24, 2021
Citations: 2	License type: CC BY 4.0

Affiliation: National University of Distance Education

Abstract

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.

Highlights

Our main contributions are as follows: (1) the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity; (2) the first collection of self-contained and reproducible benchmarks on biomedical sentence similarity; (3) the evaluation of a set of previously unexplored methods, as well as the evaluation of a new word embedding model based on FastText and trained on the full-text of articles in the PubMed Central (PMC)-BioC corpus [19]; (4) the integration for the first time of most sentence similarity methods for the biomedical domain in the same software library called HESML-Short Text Similarity (STS); and (5) a detailed reproducibility protocol together with a collection of software tools and datasets, which will be provided as supplementary material to allow the exact replication of all our experiments and results
Sogancioglu et al [20] proposed a set of ontology-based measures called WordNet-based Similarity Measure (WBSM) and Unified Medical Language System (UMLS)-based Similarity Measure (UBSM), which are based on the Li et al [21] measure
Detailed setup of each method Contextual string embeddings trained on PubMed Skip-gram trained on PubMed + PMC Skip-gram WE model trained on PubMed using word2vec program Continuous Bag of Words (CBOW) WE model trained on PubMed using word2vec program Skip-gram WE model trained on PubMed using word2vec program CBOW WE model trained on PubMed using word2vec program GloVe WE model trained on PubMed GloVe We model trained on PubMed FastText

Summary

Introduction

Measuring semantic similarity between sentences is an important task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining, among. Our main contributions are as follows: (1) the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity; (2) the first collection of self-contained and reproducible benchmarks on biomedical sentence similarity; (3) the evaluation of a set of previously unexplored methods, as well as the evaluation of a new word embedding model based on FastText and trained on the full-text of articles in the PMC-BioC corpus [19]; (4) the integration for the first time of most sentence similarity methods for the biomedical domain in the same software library called HESML-STS; and (5) a detailed reproducibility protocol together with a collection of software tools and datasets, which will be provided as supplementary material to allow the exact replication of all our experiments and results.

Methods on sentence semantic similarity

Literature review methodology

Methods proposed for the biomedical domain

ID Method

Evaluation metrics

Methods

Conclusions and future work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Protocol for a reproducible experimental survey on biomedical sentence similarity.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

Protocol for a reproducible experimental survey on biomedical sentence similarity
Alicia Lara-Clares ... Juan J Lastra-Díaz
-
Alicia Lara-Clares, et. al.Alicia Lara-Clares ... Juan J Lastra-Díaz
24 Mar 2021
24 Mar 2021

Reproducible experiments on word and sentence similarity measures for the biomedical domain
...
-
, et. al. ...
08 Nov 2021
08 Nov 2021

Graph-based Natural Language Processing and Information Retrieval
Rada Mihalcea ... Dragomir Radev
-
Rada Mihalcea, et. al.Rada Mihalcea ... Dragomir Radev
11 Apr 2011
11 Apr 2011

Graph-Based Natural Language Processing and Information Retrieval Rada Mihalcea and Dragomir Radev (University of North Texas and University of Michigan) Cambridge, UK: Cambridge University Press, 2011, viii+192 pp; hardbound, ISBN 978-0-521-89613-9, $65.00
Chris Biemann
Computational Linguistics | VOL. 38
Chris BiemannChris Biemann
01 Mar 2012
Computational Linguistics | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Protocol for a reproducible experimental survey on biomedical sentence similarity.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one