MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Karol Nowakowski,Michal Ptaszynski,Fumito Masui

doi:10.3390/info10100317

Abstract

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.

Highlights

One way to handle ambiguity—a major challenge in any Natural Language Processing task—is to consider the target text in context
In this paper we argue that in the context of word segmentation, the problem can be reduced to finding the shortest sequence of n-grams matching the input text, with little or no drop in performance compared to state-of-the-art methodologies
One of the key components of their methodology are the concatenated n-gram character representations, which offer a significant performance boost in comparison to conventional character embeddings, without resorting to external data sources. We used their implementation in the experiments described later in this paper, in order to verify how a character-based neural model performs under extremely low-resource conditions, such as those of the Ainu language, and how it compares with segmenters utilizing lexical n-grams, including ours

Summary

Introduction

One way to handle ambiguity—a major challenge in any Natural Language Processing task—is to consider the target text in context. A typical approach is to use an n-gram model, where the probability of a word depends on the n − 1 previous words. In this paper we argue that in the context of word segmentation, the problem can be reduced to finding the shortest sequence of n-grams matching the input text, with little or no drop in performance compared to state-of-the-art methodologies. The main contributions of this work are: fast n-gram model yielding results comparable to state-of-the-art systems in the task of word segmentation of the Ainu language;.

Word Segmentation in the Ainu Language

Related Work

Description of the Proposed Approach

N-gram Data

Computational Cost

Training Data

Test Data

Experiment Setup

MiNgMatch Segmenter

Segmenter with Language Model Applying Modified Kneser-Ney Smoothing

Universal Segmenter

Evaluation Method

Results and Discussion

General Observations

Error Comparison

Results on SYOS with Two Gold Standard Transcriptions

Execution Speed

Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: Oct 16, 2019
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

Chinese Word Segmentation based on Word boundary Classificatioin
Lu Li ... Hanqian Wu
-
Lu Li, et. al.Lu Li ... Hanqian Wu
20 Jul 2022
20 Jul 2022

Improving Chinese Named Entity Recognition by Large-Scale Syntactic Dependency Graph
Peng Zhu ... Fangzhou Yang
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30
Peng Zhu, et. al.Peng Zhu ... Fangzhou Yang
01 Jan 2021
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30

New Words Discovery Method Based On Word Segmentation Result
Heyang Liu ... Pengdong Gao
-
Heyang Liu, et. al.Heyang Liu ... Pengdong Gao
01 Jun 2018
01 Jun 2018

Construction of Chinese Medical Text Segmentation Model Based on Bi-GRU Algorithm
Fengyang Yu ... Shouqiang Chen
-
Fengyang Yu, et. al.Fengyang Yu ... Shouqiang Chen
10 Dec 2021
10 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information