Abstract
Semantic textual similarity is one of the open research challenges in the field of Natural Language Processing. Extensive research has been carried out in this field and near-perfect results are achieved by recent transformer-based models in existing benchmark datasets like the STS dataset and the SICK dataset. In this paper, we study the sentences in these datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences. We build a complex sentences dataset comprising of 50 sentence pairs with associated semantic similarity values provided by 15 human annotators. Readability analysis is performed to highlight the increase in complexity of the sentences in the existing benchmark datasets and those in the proposed dataset. Further, we perform a comparative analysis of the performance of various word embeddings and language models on the existing benchmark datasets and the proposed dataset. The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models resulting in a 10-20% decrease in Pearson's and Spearman's correlation.
Highlights
O NE of the core components of Natural Language Processing (NLP) is assessing the semantic similarity between text data
On analyzing the complexity of these sentences using the abovementioned readability indices we find that 70% of sentences in the semantic textual similarity (STS) dataset and 90% of sentences in the SICK dataset have an aggregate readability score below 10, while only 25% of the sentences in the proposed dataset are below the index 10
Various word embedding models have been proposed over the years to capture the semantics of the words in numeric representations
Summary
O NE of the core components of Natural Language Processing (NLP) is assessing the semantic similarity between text data. Word embeddings like word2vec [9] and GloVe [10] exploit the principle of the distributional hypothesis [11] i.e., “similar words occur in similar context” These methods use the advancements in deep learning techniques to capture the semantics of the words using large text corpora. Over the years various benchmark datasets have been used for comparing the performance of models in measuring semantic similarity between text data. A comparative analysis of various existing text embeddings is performed on two existing benchmark datasets and the proposed complex sentence dataset. Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS the impact of complexity of sentences on the performance of text embedding models.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.