UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation

Yonghua Wen,Yantuan Xian,Yuehan Wang,Zhengtao Yu

doi:10.3390/app142311435

Yonghua Wen, Yantuan Xian + Show 2 more

Open Access

https://doi.org/10.3390/app142311435

Copy DOI

Export

Save

Cite

Journal: Applied Sciences	Publication Date: Dec 9, 2024
License type: CC BY 4.0

Abstract
Full-Text
Similar Papers

Abstract

Listen

Word segmentation is a critical task in natural language processing for southeast Asian Abugida languages, including Thai, Burmese, and Khmer. Existing approaches demonstrate that models using fixed-length windowed context inputs can achieve high segmentation accuracy; however, they often rely on low-level character features or language-specific preprocessing. Character-based methods can limit feature learning, while language-specific features add complexity due to specialized preprocessing requirements. This paper introduces UnifiedCut, which is a neural model that leverages multiple n-grams within a windowed multi-head attention mechanism. This design captures segmentation features from local contexts and multi-perspective n-gram inputs, enhancing generalization and recall, particularly for out-of-vocabulary words. Compared to CNN- and RNN-based approaches, UnifiedCut’s multi-head attention enables finer-grained feature extraction and greater parallelism, resulting in a faster, more scalable solution. Comprehensive experiments on public datasets for Thai, Burmese, and Khmer show that UnifiedCutachieves state-of-the-art performance in word segmentation.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation

Abstract

Published Version

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Sequence Labeling of Chinese Text Based on Bidirectional Gru-Cnn-Crf Model
Di Liu ... Xinyi Zou
-
Di Liu, et. al.Di Liu ... Xinyi Zou
01 Dec 2018
01 Dec 2018

Investigating word segmentation of Chinese second language learners
Shuyi Yang
Reading and Writing | VOL. 34
Shuyi YangShuyi Yang
03 Jan 2021
Reading and Writing | VOL. 34

Development of thai word segmentation technique for solving problems with unknown words
Chanin Mahatthanachai ... Nuttiya Tantranont
-
Chanin Mahatthanachai, et. al.Chanin Mahatthanachai ... Nuttiya Tantranont
01 Nov 2015
01 Nov 2015

Chemical-protein interaction extraction via contextualized word representations and multihead attention.
Yijia Zhang ... Zhihao Yang
Database | VOL. 2019
Yijia Zhang, et. al.Yijia Zhang ... Zhihao Yang
01 Jan 2019
Database | VOL. 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation

Abstract

Published Version

Talk to us

Similar Papers

More From: Applied Sciences