Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

Sittichai Somsap,Pusadee Seresangtakul

doi:10.1145/3359990

Abstract

In this study, we developed an Isarn Dharma word segmentation system. We mainly focused on solving the word ambiguity and unknown word problems in unsegmented Isarn Dharma text. Ambiguous Isarn Dharma words occur frequently in word construction due to the writing style without tone markers. Thus, words can be interpreted as having different tones and meanings in the same writing text. To overcome these problems, we developed an Isarn Dharma character cluster–(IDCC) based statistical model and affixation and integrated it with the named entity recognition method (IDCC-C-based statistical model and affixation with named entity recognition (NER)). This method integrates the IDCC-based and character-based statistical models to distinguish the word boundaries. The IDCC-based statistical model utilizes the IDCC feature to disambiguate any ambiguous words. The unknown words are handled using the character-based statistical model, based on the character features. In addition, linguistic knowledge is employed to detect the boundaries of a new word based on the construction morphology and NER. In evaluations, we compared the proposed method with various word segmentation methods. The experimental results showed that the proposed method performed slightly better than the other methods when the corpus size increased. Using the test set, the proposed method obtained the best F-measure of 92.19, an F-measure that was better than the IDCC longest matching grouping at 2.85.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Feb 22, 2020
Citations: 7

Similar Papers

The Algorithms for Word Segmentation and Named Entity Recognition of Chinese Medical Records
Yuan-Nong Ye ... Meng-Ya Huang
-
Yuan-Nong Ye, et. al.Yuan-Nong Ye ... Meng-Ya Huang
01 Jan 2020
01 Jan 2020

Improving word segmentation for Thai speech translation
Paisarn Charoenpornsawat ... Tanja Schultz
-
Paisarn Charoenpornsawat, et. al.Paisarn Charoenpornsawat ... Tanja Schultz
01 Dec 2008
01 Dec 2008

Chinese Word Segmentation Based on Maximum Entropy
Xiaolin Li ... Zerong Hu
-
Xiaolin Li, et. al.Xiaolin Li ... Zerong Hu
16 Oct 2019
16 Oct 2019

Named entity recognition based on equipment and fault field of CNC machine tools
...
-
, et. al. ...
01 Apr 2020
01 Apr 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing