Orthographic vs. Semantic Representations for Unsupervised Morphological Paradigm Clustering

E Margaret Perkoff,Alexis Palmer,Josh Daniels

doi:10.18653/v1/2021.sigmorphon-1.10

Abstract

This paper presents two different systems for unsupervised clustering of morphological paradigms, in the context of the SIGMORPHON 2021 Shared Task 2. The goal of this task is to correctly cluster words in a given language by their inflectional paradigm, without any previous knowledge of the language and without supervision from labeled data of any sort. The words in a single morphological paradigm are different inflectional variants of an underlying lemma, meaning that the words share a common core meaning. They also - usually - show a high degree of orthographical similarity. Following these intuitions, we investigate KMeans clustering using two different types of word representations: one focusing on orthographical similarity and the other focusing on semantic similarity.Additionally, we discuss the merits of randomly initialized centroids versus pre-defined centroids for clustering. Pre-defined centroids are identified based on either a standard longest common substring algorithm or a connected graph method built off of longest common substring. For all development languages, the character-based embeddings perform similarly to the baseline, and the semantic embeddings perform well below the baseline.Analysis of the systems’ errors suggests that clustering based on orthographic representations is suitable for a wide range of morphological mechanisms, particularly as part of a larger system.

Highlights

One significant barrier to progress in morphological analysis is the lack of available data for most of the world’s languages
The SIGMORPHON 2021 shared task aims to leverage the unsupervised setting in order to identify morphological paradigms, at the same time including languages with a wide range of morphological properties
During testing the highly connected subgraphs (HCSs) graph analysis proved computationally taxing and was unable to be completed in time for evaluation, though qualitative analysis of the generated longest common substring (LCS) graphs suggests the technique may still be useful with better computational power

Summary

Introduction

One significant barrier to progress in morphological analysis is the lack of available data for most of the world’s languages. With resources suitable for computational morphological analysis, there is no guarantee that the available data covers all important aspects of the language, leading to significant error rates on unseen data. This uncertainty regarding training data makes unsupervised learning a natural modeling choice for the field of computational morphology. The task we tackle is to cluster surface word forms into groups that reflect the application of a morphological paradigm to a single lemma. In Phonetics, Phonology, and Morphology,pages 90–97 that the words share a common core meaning They - usually - show a high degree of orthographical similarity. The final output of the system is a set of clusters, each one representing a morphological paradigm

Previous Work

Task overview

For each word:

Word Representations

Clustering

Results

Error Analysis

Regular Verb Forms

Irregular Verb Forms

Character Distance Errors

Non-Affix Based Morphology

Word Length

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Orthographic vs. Semantic Representations for Unsupervised Morphological Paradigm Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Similar Papers

How to establish a verbal paradigm on the basis of ancient Syriac manuscripts
W Th (Wido) Van Peursen
-
W Th (Wido) Van PeursenW Th (Wido) Van Peursen
01 Jan 2009
01 Jan 2009

Blocking and paradigm gaps
Itamar Kastner ... Vera Zu
Morphology | VOL. 27
Itamar Kastner, et. al.Itamar Kastner ... Vera Zu
30 Aug 2017
Morphology | VOL. 27

Impaired Phonological and Orthographic Word Representations Among Adult Dyslexic Readers: Evidence From Event-Related Potentials
Ann Meyler ... Zvia Breznitz
The Journal of Genetic Psychology | VOL. 166
Ann Meyler, et. al.Ann Meyler ... Zvia Breznitz
01 Jun 2005
The Journal of Genetic Psychology | VOL. 166

More or Less Unnatural: Semantic Similarity Shapes the Learnability and Cross-Linguistic Distribution of Unnatural Syncretism in Morphological Paradigms.
Carmen Saldana ... Borja Herce
Open Mind | VOL. 6
Carmen Saldana, et. al.Carmen Saldana ... Borja Herce
30 Oct 2022
Open Mind | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Orthographic vs. Semantic Representations for Unsupervised Morphological Paradigm Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers