Thai Word Segmentation with a Brain-Inspired Sparse Distributed Representations Learning Memory.

Thasayu Soisoonthorn,Maleerat Maliyaem,Herwig Unger

doi:10.1155/2023/8592214

Abstract

Word segmentation is necessary for many natural language processing, especially Thai language, that is, unsegmented words. However, wrong segmentation causes terrible performance in the final result. In this study, we propose two new brain-inspired methods based on Hawkins' approach to address Thai word segmentation. Sparse Distributed Representations (SDRs) are used to model the neocortex structure of the brain to store and transfer information. The first proposed method, THDICTSDR, improves the dictionary-based approach by utilizing SDRs to learn the surrounding context and combine with n-gram to select the correct word. The second method uses SDRs instead of a dictionary and is called THSDR. The evaluation uses the BEST2010 and LST20 standard datasets for segmentation words by comparing them with the longest matching, newmm, and Deepcut, which is state-of-the-art in the deep learning approach. The result shows that the first method provides the accuracy, and performances are significantly better than other dictionary bases. The first new method can achieve F1-Score at 95.60%, comparable to the state-of-the-art and Deepcut F1-Score at 96.34%. However, it provides a better performance F1-Score at 96.78% in learning all vocabularies. In addition, it can achieve 99.48% F1-Score beyond Deepcut 97.65% in case of all sentences being learnt. The second method has fault tolerance to noise and provides overall result over deep learning in all cases.

Full Text