Abstract

Morphological segmentation is a basic task in agglutinative language information processing, dividing words into the smallest semantic unit morphemes. There are two types of morphological segmentation: canonical segmentation and surface segmentation. As a typical agglutinative language, Uyghur usually uses statistical-based methods in canonical segmentation, which relies on the artificial extraction of features. In surface segmentation, the artificial feature extraction process is avoided by using the neural network. However, to date, no model can provide both segmentation results in Uyghur without adding features. In addition, morphological segmentation is usually regarded as a sequence annotation task, so label imbalance easily occurs in datasets. Given the above situation, this paper proposes an improved labelling scheme that joins morphological boundary labels and voice harmony labels for the two kinds of segmentation simultaneously. Then, a convolution network and attention mechanism are added to capture local and global features, respectively. Finally, morphological segmentation is regarded as a sequence labeling task of character sequences. Due to the problem of label proportion imbalance and noise in the dataset, a focal loss function with label smoothing is used. The experimental results show that the F1 values of canonical segmentation and surface segmentation achieve the best results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call