Word segmentation is a critical task in natural language processing for southeast Asian Abugida languages, including Thai, Burmese, and Khmer. Existing approaches demonstrate that models using fixed-length windowed context inputs can achieve high segmentation accuracy; however, they often rely on low-level character features or language-specific preprocessing. Character-based methods can limit feature learning, while language-specific features add complexity due to specialized preprocessing requirements. This paper introduces UnifiedCut, which is a neural model that leverages multiple n-grams within a windowed multi-head attention mechanism. This design captures segmentation features from local contexts and multi-perspective n-gram inputs, enhancing generalization and recall, particularly for out-of-vocabulary words. Compared to CNN- and RNN-based approaches, UnifiedCut’s multi-head attention enables finer-grained feature extraction and greater parallelism, resulting in a faster, more scalable solution. Comprehensive experiments on public datasets for Thai, Burmese, and Khmer show that UnifiedCutachieves state-of-the-art performance in word segmentation.
Read full abstract