Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations

Chanatip Saetia,Tawunrat Chalothorn,Ekapol Chuangsuwanich,Peerapon Vateekul

doi:10.4186/ej.2021.25.6.15

Abstract

A sentence is typically treated as the minimal syntactic unit used for extracting valuable information from a longer piece of text. However, in written Thai, there are no explicit sentence markers. We proposed a deep learning model for the task of sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investigate and adapt Cross-View Training (CVT) as a semi-supervised learning technique, allowing us to utilize unlabeled data to improve the model representations. In the Thai sentence segmentation experiments, our model reduced the relative error by 7.4% and 10.5% compared with the baseline models on the Orchid and UGWC datasets, respectively. We also applied our model to the task of pronunciation recovery on the IWSLT English dataset. Our model outperformed the prior sequence tagging models, achieving a relative error reduction of 2.5%. Ablation studies revealed that utilizing n-gram presentations was the main contributing factor for Thai, while the semi-supervised training helped the most for English.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations

Abstract

Talk to us

Similar Papers

More From: Engineering Journal

Lead the way for us

Journal: Engineering Journal	Publication Date: Jun 30, 2021
Citations: 1

Similar Papers

Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations
Chanatip Saetia ... Ekapol Chuangsuwanich
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -
Chanatip Saetia, et. al.Chanatip Saetia ... Ekapol Chuangsuwanich
07 May 2020
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -

Effective semi-supervised learning strategies for automatic sentence segmentation
Dogan Dalva ... Hakan Gurkan
Pattern Recognition Letters | VOL. 105
Dogan Dalva, et. al.Dogan Dalva ... Hakan Gurkan
10 Oct 2017
Pattern Recognition Letters | VOL. 105

An adaptable sentence segmentation based on Indonesian rules
Johannes Petrus ... Sukemi Sukemi
IAES International Journal of Artificial Intelligence (IJ-AI) | VOL. 12
Johannes Petrus, et. al.Johannes Petrus ... Sukemi Sukemi
01 Sep 2023
IAES International Journal of Artificial Intelligence (IJ-AI) | VOL. 12

Automatic Sentence Segmentation of Speech for Automatic Summarization
J Mrozinski ... P Chatain
-
J Mrozinski, et. al.J Mrozinski ... P Chatain
14 May 2006
14 May 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations

Abstract

Talk to us

Similar Papers

More From: Engineering Journal