Distributional Models with Syntactic Contexts for the Measurement of Word Similarity in Brazilian Portuguese

Eduardo E Berlitz,Allan B Silva,Sandro J Rigo,Denis A Araujo,Rodrigo R Righi

doi:10.3844/jcssp.2019.1378.1389

Eduardo E Berlitz, Allan B Silva + Show 3 more

Open Access

https://doi.org/10.3844/jcssp.2019.1378.1389

Copy DOI

Abstract

The similarity between words constitutes significant support to tasks in natural language processing. Several works use Lexical resources such as WordNet for semantic similarity and synonym identification. Nevertheless, words out-of-vocabulary or missing links between senses are perceived problems of this approach. Distributional-based proposals like word embeddings have successfully been used to meet such problems, but the lack of contextual information can prevent the achievement of even better results. The distributional models that include contextual information can bring advantages to this area, but these models are still scarcely explored. Therefore, this work studies the advantages of incorporating syntactic information in the distributional models, fostering for better results in semantic similarity approaches. For that purpose, the current work explore existing lexical and distributional techniques regarding the measurement of word similarity in Brazilian Portuguese. Experiments were carried out with the lexical database WordNet, using different techniques over a standard dataset. The results indicate that word embeddings can cover words out of vocabulary and have better results in comparison with lexical approaches. The main contribution of this article is a new approach to apply syntactic context in the training process of word embeddings to a Brazilian Portuguese corpus. The comparison of this model with the outcome of the previous experiments shows sound results and presents relevant complementary aspects.

Highlights

The identification of the semantic similarity between words is a research subject explored in the last years (Pawar and Mago, 2018)
The importance of identifying word similarity is more apparent for some natural language processing tasks such as summarizing, information extraction, information retrieval and question answering
Because of the results presented by Granada et al (2014), current work ended up using the PT65 as a gold-standard for evaluating semantic similarity and relatedness between words with word embedding models and the WordNet

Summary

Introduction

The identification of the semantic similarity between words is a research subject explored in the last years (Pawar and Mago, 2018). The importance of identifying word similarity is more apparent for some natural language processing tasks such as summarizing, information extraction, information retrieval and question answering. In those tasks, the search for information through texts demands systems to find the correct results considering the flexibility existent in natural language. The search for information through texts demands systems to find the correct results considering the flexibility existent in natural language This can be observed, for instance, in those situations when the texts may not contain the same words used in the search query but yet contain similar words as synonyms. This can be observed in the opposite situation when the system needs to deal with polysemic words (Jurafsky and Martin, 2009)

Methods

Results

Conclusion