Abstract

In this study, we developed an Isarn Dharma word segmentation system. We mainly focused on solving the word ambiguity and unknown word problems in unsegmented Isarn Dharma text. Ambiguous Isarn Dharma words occur frequently in word construction due to the writing style without tone markers. Thus, words can be interpreted as having different tones and meanings in the same writing text. To overcome these problems, we developed an Isarn Dharma character cluster–(IDCC) based statistical model and affixation and integrated it with the named entity recognition method (IDCC-C-based statistical model and affixation with named entity recognition (NER)). This method integrates the IDCC-based and character-based statistical models to distinguish the word boundaries. The IDCC-based statistical model utilizes the IDCC feature to disambiguate any ambiguous words. The unknown words are handled using the character-based statistical model, based on the character features. In addition, linguistic knowledge is employed to detect the boundaries of a new word based on the construction morphology and NER. In evaluations, we compared the proposed method with various word segmentation methods. The experimental results showed that the proposed method performed slightly better than the other methods when the corpus size increased. Using the test set, the proposed method obtained the best F-measure of 92.19, an F-measure that was better than the IDCC longest matching grouping at 2.85.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call