Abstract

Word relatedness computation is an important supporting technology for many tasks in natural language processing. Traditionally, there have been two distinct strategies for word relatedness measurement: one utilizes corpus-based models, whereas the other leverages external lexical resources. However, either solution has its strengths and weaknesses. In this paper, we propose a lexical resource-constrained topic model to integrate the two complementary strategies effectively. Our model is an extension of probabilistic latent semantic analysis, which automatically learns word-level distributed representations forward relatedness measurement. Furthermore, we introduce generalized expectation maximization (GEM) algorithm for statistical estimation. The proposed model not merely inherit the advantage of conventional topic models in dimension reduction, but it also refines parameter estimation by using word pairs that are known to be related. The experimental results in different languages demonstrate the effectiveness of our model in topic extraction and word relatedness measurement.

Highlights

  • Semantic relatedness computation between two words is of great importance in the field of Natural Language Processing (NLP) [1]

  • THE PROPOSED TOPIC MODEL By introducing lexical resource as constraints into the conventional Probabilistic Latent Semantic Analysis (PLSA) model, we extend PLSA to a lexical resource-constrained one for word relatedness measurement

  • We applied cosine distance function to compute relatedness within each word pair from test sets according to the distributed word representations generated by PLSA

Read more

Summary

Introduction

Semantic relatedness computation between two words is of great importance in the field of Natural Language Processing (NLP) [1]. Many methods have been proposed to identify semantic similarity of word pairs and can be mainly classified into the following two categories: (1) Corpus based approaches In this respect, the related approaches based on large-scale corpus can be further divided into neural and non-neural processes. Statistical approaches like Vector Space Model(VSM) [2], Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet allocation (LDA) are typical non-neural processes, while word embeddings such as Word2vec [3] are neural solutions. Results of these approaches rely too much on the quantity and quality of training corpus while lack of the external influence of lexical resources which generally possess high quality. Results of these approaches rely too much on the quantity and quality of training corpus while lack of the external influence of lexical resources which generally possess high quality. (2) Lexical

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.