Comparative Analysis of Word Embeddings in Assessing Semantic Similarity of Complex Sentences

Dhivya Chandrasekaran,Vijay Mago

doi:10.1109/access.2021.3135807

Abstract

Semantic textual similarity is one of the open research challenges in the field of Natural Language Processing. Extensive research has been carried out in this field and near-perfect results are achieved by recent transformer-based models in existing benchmark datasets like the STS dataset and the SICK dataset. In this paper, we study the sentences in these datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences. We build a complex sentences dataset comprising of 50 sentence pairs with associated semantic similarity values provided by 15 human annotators. Readability analysis is performed to highlight the increase in complexity of the sentences in the existing benchmark datasets and those in the proposed dataset. Further, we perform a comparative analysis of the performance of various word embeddings and language models on the existing benchmark datasets and the proposed dataset. The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models resulting in a 10-20% decrease in Pearson's and Spearman's correlation.

Highlights

O NE of the core components of Natural Language Processing (NLP) is assessing the semantic similarity between text data
On analyzing the complexity of these sentences using the abovementioned readability indices we find that 70% of sentences in the semantic textual similarity (STS) dataset and 90% of sentences in the SICK dataset have an aggregate readability score below 10, while only 25% of the sentences in the proposed dataset are below the index 10
Various word embedding models have been proposed over the years to capture the semantics of the words in numeric representations

Summary

INTRODUCTION

O NE of the core components of Natural Language Processing (NLP) is assessing the semantic similarity between text data. Word embeddings like word2vec [9] and GloVe [10] exploit the principle of the distributional hypothesis [11] i.e., “similar words occur in similar context” These methods use the advancements in deep learning techniques to capture the semantics of the words using large text corpora. Over the years various benchmark datasets have been used for comparing the performance of models in measuring semantic similarity between text data. A comparative analysis of various existing text embeddings is performed on two existing benchmark datasets and the proposed complex sentence dataset. Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS the impact of complexity of sentences on the performance of text embedding models.

RELATED WORK

METHODOLOGY

HUMAN ANNOTATION

RESULTS AND DISCUSSION

CONCLUSION

Findings

24 A system for storing and taking care of data

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Comparative Analysis of Word Embeddings in Assessing Semantic Similarity of Complex Sentences

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A BERT-GRU Model for Measuring the Similarity of Arabic Text
Rakia Saidi ... Didier Schwab
JUCS - Journal of Universal Computer Science | VOL. 30
Rakia Saidi, et. al.Rakia Saidi ... Didier Schwab
28 Jun 2024
JUCS - Journal of Universal Computer Science | VOL. 30

Semantic Similarity of Arabic Sentences with Word Embeddings
El Moatez Billah Nagoudi ... Didier Schwab
-
El Moatez Billah Nagoudi, et. al.El Moatez Billah Nagoudi ... Didier Schwab
01 Jan 2017
01 Jan 2017

Cross-Lingual Semantic Textual Similarity Modeling Using Neural Networks
Xia Li ... Minping Chen
-
Xia Li, et. al.Xia Li ... Minping Chen
01 Jan 2019
01 Jan 2019

A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings
Karlo Babić ... Francesco Guerra
Journal of information and organizational sciences | VOL. 44
Karlo Babić, et. al.Karlo Babić ... Francesco Guerra
09 Dec 2020
Journal of information and organizational sciences | VOL. 44

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparative Analysis of Word Embeddings in Assessing Semantic Similarity of Complex Sentences

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Access