Thai Spelling Correction and Word Normalization on Social Text Using a Two-Stage Pipeline With Neural Contextual Attention

Anuruth Lertpiya,Tawunrat Chalothorn,Ekapol Chuangsuwanich

doi:10.1109/access.2020.3010828

Anuruth Lertpiya, Tawunrat Chalothorn + Show 1 more

Open Access

https://doi.org/10.1109/access.2020.3010828

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 8	License type: CC BY 4.0

Affiliation: Chulalongkorn University, Silpakorn University

Abstract

Text correction systems (e.g., spell checkers) have been used to improve the quality of computerized text by detecting and correcting errors. However, the task of performing spelling correction and word normalization (text correction) for Thai social media text has remained largely unexplored. In this paper, we investigated how current text correction systems perform on correcting errors and word variances in Thai social texts and propose a method designed for this task. We have found that currently available Thai text correction systems are insufficiently robust for correcting spelling errors and word variances, while the text correctors designed for English grammatical error correction suffer from overcorrections (text rewrites). Thus, we proposed a neural-based text corrector with a two-stage structure to alleviate issues of overcorrections while exploiting the benefits of a neural Seq2Seq corrector. Our method consists of a neural-based error detector and a Seq2Seq neural error corrector with contextual attention. This novel architecture allows the Seq2Seq network to produce corrections based on both the erroneous text and its context without the need for an end-to-end structure. Our method outperformed all the other evaluated text correction systems. When compared to the second-best result (copy-augmented transformer), our method further reduced the word error rate (WER) from 2.51% to 2.07%, improved the generalized language evaluation understanding (GLEU) score from 0.9409 to 0.9502 on the Thai text correction task, and improved the GLEU score from 0.7409 to 0.7539 on the English spelling correction task.

Highlights

The fast and widespread adoption of social media as a means of communication has led to an explosive increase in usergenerated text data on the Internet
We examined a variety of text correction techniques, ranging from dictionary-based (i.e., Hunspell [9]) and statistically based methods (i.e., PyThaiNLP [10]) to modern systems featuring sequence-to-sequence neural networks employed in state-ofthe-art English grammatical error correction (GEC) systems (i.e., bidirectional GRU (Bi-GRU) Seq2Seq [11], Copy-Augmented Transformer [12])
The TC approaches were evaluated on three tasks: our TC task on Thai user-generated web content (UGWC) and two TC tasks derived from the English Conll-2014 shared task [21]

Summary

Introduction

The fast and widespread adoption of social media as a means of communication has led to an explosive increase in usergenerated text data on the Internet. Natural language processing (NLP) techniques are often used to keep up with the pace of rapidly growing data and introduce new and exciting applications such as real-time disease surveillance [1] and monitoring the public perceptions of brands, products, and services (social listening). Social text introduces challenges not previously found in traditional written media (e.g., news, published articles), such as a wide variety of language usage from users with varying levels of language. The large quantity of data on social media, which is input via other interfaces (e.g., physical and virtual keyboards), does not strictly exhibit the same types of errors as do data from OCR systems. The text correction systems developed and employed in free open source software (FOSS) have yet to be evaluated on social texts

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Thai Spelling Correction and Word Normalization on Social Text Using a Two-Stage Pipeline With Neural Contextual Attention

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach

Journal of Information Engineering and Applications | VOL. 10

01 Sep 2020
Journal of Information Engineering and Applications | VOL. 10

Improving Document Retrieval with Spelling Correction for Weak and Fabricated Indonesian-Translated Hadith
Muhammad Zaky Ramadhan ... Kemas Muslim Lhaksmana
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) | VOL. 4
Muhammad Zaky Ramadhan, et. al. Muhammad Zaky Ramadhan ... Kemas Muslim Lhaksmana
20 Jun 2020
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) | VOL. 4

Bangla Spell Checking and Correction Using Edit Distance
Muhammad Ifte Khairul Islam ... Aniruddha Rakshit
-
Muhammad Ifte Khairul Islam, et. al.Muhammad Ifte Khairul Islam ... Aniruddha Rakshit
01 May 2019
01 May 2019

Towards improving speech recognition model with post-processing spell correction using BERT
M.C Shunmuga Priya ... L Ashok Kumar
Journal of Intelligent & Fuzzy Systems | VOL. 43
M.C Shunmuga Priya, et. al.M.C Shunmuga Priya ... L Ashok Kumar
10 Aug 2022
Journal of Intelligent & Fuzzy Systems | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Thai Spelling Correction and Word Normalization on Social Text Using a Two-Stage Pipeline With Neural Contextual Attention

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access