Research on the Uyghur morphological segmentation model with an attention mechanism

Kahaerjing Abiderexiti,Yunfei Shen,Gulinigeer Abudouwaili,Aishan Wumaier

doi:10.1080/09540091.2022.2134843

Kahaerjing Abiderexiti, Yunfei Shen + Show 2 more

Open Access

https://doi.org/10.1080/09540091.2022.2134843

Copy DOI

Journal: Connection Science	Publication Date: Oct 21, 2022
Citations: 1	License type: open-access

Affiliation: Xinjiang University

Abstract

Morphological segmentation is a basic task in agglutinative language information processing, dividing words into the smallest semantic unit morphemes. There are two types of morphological segmentation: canonical segmentation and surface segmentation. As a typical agglutinative language, Uyghur usually uses statistical-based methods in canonical segmentation, which relies on the artificial extraction of features. In surface segmentation, the artificial feature extraction process is avoided by using the neural network. However, to date, no model can provide both segmentation results in Uyghur without adding features. In addition, morphological segmentation is usually regarded as a sequence annotation task, so label imbalance easily occurs in datasets. Given the above situation, this paper proposes an improved labelling scheme that joins morphological boundary labels and voice harmony labels for the two kinds of segmentation simultaneously. Then, a convolution network and attention mechanism are added to capture local and global features, respectively. Finally, morphological segmentation is regarded as a sequence labeling task of character sequences. Due to the problem of label proportion imbalance and noise in the dataset, a focal loss function with label smoothing is used. The experimental results show that the F1 values of canonical segmentation and surface segmentation achieve the best results.

Full Text