Abstract

We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.

Highlights

  • Non-standard domains, dialectal variation, and unstandardized spelling make segmentation challenging, though morphologically rich languages require good segmentation to enable downstream applications from syntactic parsing to machine translation (MT)

  • We present De-lexical Segmentation (DESEG), a slightly more expensive but powerful alternative to language agnostic morphological segmentation, realizing most of the benefits of supervised segmentation at far less a cost

  • Using a corpus of several Arabic dialects exhibiting rich and complex morphology, unstandardized spelling, and variation bordering on mutual unintelligibility, we evaluate DESEG intrinsically on language modeling (LM) and extrinsically on MT

Read more

Summary

Introduction

Non-standard domains, dialectal variation, and unstandardized spelling make segmentation challenging, though morphologically rich languages require good segmentation to enable downstream applications from syntactic parsing to machine translation (MT). Language agnostic unsupervised options like MORFESSOR (Creutz and Lagus, 2005) and byte pair encoding (BPE) (Sennrich et al, 2016) assume no resources beyond raw text but can yield lower performance on downstream tasks (Vania and Lopez, 2017; Kann et al, 2018) They suffer from typological biases and favor intended applications at the expense of others. DESEG consistently outperforms MORFESSOR and BPE while only costing a few hours of grammar-building labor; and in some environments it outperforms state-of-the-art supervised Arabic tokenizers MADAMIRA (Pasha et al, 2014) and FARASA (Abdelali et al, 2016) The success of such a simple model is strong evidence for the value of linguistic input during preprocessing. DESEG is publicly available at github. com/CAMeL-Lab/deSeg

Related Work
De-lexical Segmentation for Arabic
Arabic and its Dialects
De-lexical Analysis
Unsupervised Disambiguation
Models
Intrinsic Language Modeling Evaluation
Extrinsic Machine Translation Evaluation
Error Analysis
Conclusion and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.