A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance

Alexander Erdmann,Mai Oudah,Nizar Habash,Houda Bouamor,Salam Khalifa

doi:10.18653/v1/w19-4214

Abstract

We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.

Highlights

Non-standard domains, dialectal variation, and unstandardized spelling make segmentation challenging, though morphologically rich languages require good segmentation to enable downstream applications from syntactic parsing to machine translation (MT)
We present De-lexical Segmentation (DESEG), a slightly more expensive but powerful alternative to language agnostic morphological segmentation, realizing most of the benefits of supervised segmentation at far less a cost
Using a corpus of several Arabic dialects exhibiting rich and complex morphology, unstandardized spelling, and variation bordering on mutual unintelligibility, we evaluate DESEG intrinsically on language modeling (LM) and extrinsically on MT

Summary

Introduction

Non-standard domains, dialectal variation, and unstandardized spelling make segmentation challenging, though morphologically rich languages require good segmentation to enable downstream applications from syntactic parsing to machine translation (MT). Language agnostic unsupervised options like MORFESSOR (Creutz and Lagus, 2005) and byte pair encoding (BPE) (Sennrich et al, 2016) assume no resources beyond raw text but can yield lower performance on downstream tasks (Vania and Lopez, 2017; Kann et al, 2018) They suffer from typological biases and favor intended applications at the expense of others. DESEG consistently outperforms MORFESSOR and BPE while only costing a few hours of grammar-building labor; and in some environments it outperforms state-of-the-art supervised Arabic tokenizers MADAMIRA (Pasha et al, 2014) and FARASA (Abdelali et al, 2016) The success of such a simple model is strong evidence for the value of linguistic input during preprocessing. DESEG is publicly available at github. com/CAMeL-Lab/deSeg

Related Work

De-lexical Segmentation for Arabic

Arabic and its Dialects

De-lexical Analysis

Unsupervised Disambiguation

Models

Intrinsic Language Modeling Evaluation

Extrinsic Machine Translation Evaluation

Error Analysis

Conclusion and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2019
Citations: 42	License type: cc-by

Similar Papers

Pruning-Based Unsupervised Segmentation for Korean
I.-S Kang ... J.-H Lee
IEICE Transactions on Information and Systems | VOL. E89-D
I.-S Kang, et. al.I.-S Kang ... J.-H Lee
01 Oct 2006
IEICE Transactions on Information and Systems | VOL. E89-D

Automatic Action Extraction for Short Text Conversation Using Unsupervised Learning
Senthil Ganesan Yuvaraj ...
-
Senthil Ganesan Yuvaraj, et. al.Senthil Ganesan Yuvaraj ...
01 Jan 2020
01 Jan 2020

Comparing supervised and unsupervised multiresolution segmentation approaches for extracting buildings from very high resolution imagery
Mariana Belgiu ... Lucian Drǎguţ
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 96
Mariana Belgiu, et. al.Mariana Belgiu ... Lucian Drǎguţ
28 Jul 2014
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 96

A Primer on Machine Learning.
Audrene S Edwards ... Bruce Kaplan
Transplantation | VOL. 105
Audrene S Edwards, et. al.Audrene S Edwards ... Bruce Kaplan
18 Aug 2020
Transplantation | VOL. 105

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance

Abstract

Highlights

Summary

Talk to us

Similar Papers