Word Sense Induction in Persian and English: A Comparative Study

Masood Ghayoomi

doi:10.52547/jist.9.36.263

Abstract

Words in the natural language have forms and meanings, and there might not always be a one-to-one match between them. This property of the language causes words to have more than one meaning; as a result, a text processing system faces challenges to determine the precise meaning of the target word in a sentence. Using lexical resources or lexical databases, such as WordNet, might be a help, but due to their manual development, they become outdated by passage of time and language change. Moreover, the lexical resources might be domain dependent which are unusable for open domain natural language processing tasks. These drawbacks are a strong motivation to use unsupervised machine learning approaches to induce word senses from the natural data. To reach the goal, the clustering approach can be utilized such that each cluster resembles a sense. In this paper, we study the performance of a word sense induction model by using three variables: a) the target language: in our experiments, we run the induction process on Persian and English; b) the type of the clustering algorithm: both parametric clustering algorithms, including hierarchical and partitioning, and non-parametric clustering algorithms, including probabilistic and density-based, are utilized to induce senses; c) the context of the target words to capture the information in vectors created for clustering: for the input of the clustering algorithms, the vectors are created either based on the whole sentence in which the target word is located; or based on the limited surrounding words of the target word. We evaluate the clustering performance externally. Moreover, we introduce a normalized, joint evaluation metric to compare the models. The experimental results for both Persian and English test data showed that the window-based partitioningK-means algorithm obtained the best performance.

Highlights

Language, as a means of communication between human beings, is composed of two components [1]: form, and meaning
To evaluate the performance of the clustering algorithms, we use two naïve baselines introduced in SemEval2010 [40]: a) the Most Frequent Sense (MFS) where all instances are assigned to a single cluster that contains the most frequent sense; b) one sense per cluster, thereafter called 1S1C, where each instance is assigned to an individual cluster; the number of clusters is equal to the number of instances
There are two SOTA results reported in the literature: a) the Chinese Restaurant Processing (CRP) algorithm utilized by Li and Jurafsky [30] for non-parametric clustering; and b) the K-means algorithm proposed by Neelakantan et al [29] for parametric clustering

Summary

Introduction

As a means of communication between human beings, is composed of two components [1]: form, and meaning. The „form‟ can be represented either via an audio signal transmitted through a voice channel from a speaker to a recipient, or via an orthographic form through the writing system and the alphabetical set of the language. The orthographic form of the language is taken into consideration. Ambiguity is a property of a natural language that causes challenges in text processing. There exist two types of ambiguities: a) syntactic ambiguity, and b) lexical ambiguity. The sentence „I saw the man with a telescope.‟, for instance, is a sample of syntactic ambiguity to either mean „I used a telescope to see the man‟ or „I saw the man who carried a telescope‟

Methods

Results

Conclusion