Thai Words Segmentation Using an Unsupervised Learning Technique

Jirapon Sunkpho,Markus Hofmann

doi:10.1007/978-3-030-44044-2_9

Abstract

Word Segmentation or Tokenization is the process of determining the best likely sequence of words from a sequence of text. For Thai language, word segmentation is not a trivial task as words and sentences in Thai are written continuously without any spaces or delimiters. Most techniques for word segmentation, especially when using machine learning, requires manually tagged data where words begin and end as a training dataset. In this study, an unsupervised machine learning technique that does not require the use of manually tagged data was developed. The technique involves breaking input text into syllables and then uses Genetic Algorithms (GA) to merge the syllables back into words. GA identifies the best segmentation of words by minimizing word distance which is the novel concept developed in this study. It is the sum of all syllable distances of every pair of syllables within a word. The syllable distance is the measure of how far apart each pair of syllables is in a document. The implementation was done using Python and achieves 70% accuracy (F1 measure) while using a 100k untagged words training dataset. The performance also improves with more training data and some tuning of GA parameters.

Full Text