Neural sentence embedding models for semantic similarity estimation in the biomedical domain

Kathrin Blagec,Hong Xu,Asan Agibetov,Matthias Samwald

doi:10.1186/s12859-019-2789-2

Kathrin Blagec, Hong Xu + Show 2 more

Open Access

PDF Available

https://doi.org/10.1186/s12859-019-2789-2

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundNeural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set.ResultsExperimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor.ConclusionsIn this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work.

Highlights

Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space
While knowledge-based measures have previously been shown to be more effective for semantic similarity estimation in the biomedical field, they are dependent on the availability of domain-specific ontologies, whose creation – despite the emergence of automatic and semi-automatic ontology learning – still remains a tedious, work-intensive and error-prone task [1]
We investigate the usefulness of current state-of-the-art neural sentence embedding models for semantic similarity estimation in the biomedical domain

Summary

Introduction

Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. Blagec et al BMC Bioinformatics (2019) 20:178 that are able to accurately capture and quantify their semantic relatedness Such semantic measures can be broadly divided into two categories: distributional and knowledge-based metrics, depending on whether they use corpora of texts or ontologies as proxies, respectively. Boosted by advances in hardware technology that allow fast processing of large amounts of text data, such neural network-based methods for embedding words, sentences or even larger text elements in low-dimensional vector space have recently caught attention for their ability to effectively capture semantic information [3, 4]

Methods

Results

Discussion

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Apr 11, 2019
Citations: 21	License type: open-access

R Discovery Prime

Neural sentence embedding models for semantic similarity estimation in the biomedical domain

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Sentence Embedding Models for Similarity Detection of Software Requirements
Souvick Das ... Nabendu Chaki
SN Computer Science | VOL. 2
Souvick Das, et. al.Souvick Das ... Nabendu Chaki
02 Feb 2021
SN Computer Science | VOL. 2

Exploiting Latent Semantic Subspaces to Derive Associations for Specific Pharmaceutical Semantics
Janus Wawrzinek ... José María González Pinto
Data Science and Engineering | VOL. 5
Janus Wawrzinek, et. al.Janus Wawrzinek ... José María González Pinto
24 Aug 2020
Data Science and Engineering | VOL. 5

Aggregating Neural Word Embeddings for Document Representation
Ruqing Zhang ... Jiafeng Guo
-
Ruqing Zhang, et. al.Ruqing Zhang ... Jiafeng Guo
01 Jan 2018
01 Jan 2018

Improving Arabic information retrieval using word embedding similarities
Abdelkader El Mahdaouy ... Eric Gaussier
International Journal of Speech Technology | VOL. 21
Abdelkader El Mahdaouy, et. al.Abdelkader El Mahdaouy ... Eric Gaussier
19 Jan 2018
International Journal of Speech Technology | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Neural sentence embedding models for semantic similarity estimation in the biomedical domain

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics