Abstract

This paper presents two different systems for unsupervised clustering of morphological paradigms, in the context of the SIGMORPHON 2021 Shared Task 2. The goal of this task is to correctly cluster words in a given language by their inflectional paradigm, without any previous knowledge of the language and without supervision from labeled data of any sort. The words in a single morphological paradigm are different inflectional variants of an underlying lemma, meaning that the words share a common core meaning. They also - usually - show a high degree of orthographical similarity. Following these intuitions, we investigate KMeans clustering using two different types of word representations: one focusing on orthographical similarity and the other focusing on semantic similarity.Additionally, we discuss the merits of randomly initialized centroids versus pre-defined centroids for clustering. Pre-defined centroids are identified based on either a standard longest common substring algorithm or a connected graph method built off of longest common substring. For all development languages, the character-based embeddings perform similarly to the baseline, and the semantic embeddings perform well below the baseline.Analysis of the systems’ errors suggests that clustering based on orthographic representations is suitable for a wide range of morphological mechanisms, particularly as part of a larger system.

Highlights

  • One significant barrier to progress in morphological analysis is the lack of available data for most of the world’s languages

  • The SIGMORPHON 2021 shared task aims to leverage the unsupervised setting in order to identify morphological paradigms, at the same time including languages with a wide range of morphological properties

  • During testing the highly connected subgraphs (HCSs) graph analysis proved computationally taxing and was unable to be completed in time for evaluation, though qualitative analysis of the generated longest common substring (LCS) graphs suggests the technique may still be useful with better computational power

Read more

Summary

Introduction

One significant barrier to progress in morphological analysis is the lack of available data for most of the world’s languages. With resources suitable for computational morphological analysis, there is no guarantee that the available data covers all important aspects of the language, leading to significant error rates on unseen data. This uncertainty regarding training data makes unsupervised learning a natural modeling choice for the field of computational morphology. The task we tackle is to cluster surface word forms into groups that reflect the application of a morphological paradigm to a single lemma. In Phonetics, Phonology, and Morphology,pages 90–97 that the words share a common core meaning They - usually - show a high degree of orthographical similarity. The final output of the system is a set of clusters, each one representing a morphological paradigm

Previous Work
Task overview
For each word:
Word Representations
Clustering
Results
Error Analysis
Regular Verb Forms
Irregular Verb Forms
Character Distance Errors
Non-Affix Based Morphology
Word Length
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.