Augmented-syllabification of n-gram tagger for Indonesian words and named-entities

Suyanto Suyanto,Andi Sunyoto,Rezza Nafi Ismail,Ade Romadhony,Febryanti Sthevanie

doi:10.1016/j.heliyon.2022.e11922

Abstract

As one of the statistical-based models, an n-gram syllabification commonly gives a high syllable error rate (SER) for Bahasa Indonesia, one of the low-resource languages, since it fails for a high out-of-vocabulary (OOV) rate. Two previous models: bigram-syllabification with flipping onsets (BFO) and a combination of bigram with backoff smoothing based on phonological similarity (CBSPS), which use augmentation methods, can reduce the OOV rate. However, there are two problems in both BFO and CBSPS. First, they use an n-gram that is applied syllable-level, instead of grapheme-level, so that they suffer on the sparsity of n-grams. Second, they rely on a procedure to detect the positions of both vowels and diphthongs. Both problems make them not capable of distinguishing diphthongs from derivative words as well as syllabifying named-entities, which have many ambiguities related to vowels and semi-vowels. In this paper, a syllabification based on an n-gram tagger, which is applied on grapheme-level and does not rely on both vowel and diphthong detections, is developed to solve both problems. Besides, three data augmentation methods are exploited to enrich the dataset. The 5-fold cross-validations (5-FCV) using both datasets of 50 k words and 15 k named-entities show that the proposed augmented-syllabification of n-gram tagger (ASnGT) model is significantly better than both BFO and CBSPS. It is also significantly better than the fuzzy k-nearest neighbor in every class (FkNNC)-based model for formal words and named-entities. However, it suffers from derivative words, where it cannot easily distinguish them from both absorption words and terms of foreign languages. Besides, it also undergoes some foreign named-entities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Augmented-syllabification of n-gram tagger for Indonesian words and named-entities

Abstract

Talk to us

Similar Papers

More From: Heliyon

Lead the way for us

Similar Papers

Phonological similarity-based backoff smoothing to boost a bigram syllable boundary detection
Suyanto Suyanto
International Journal of Speech Technology | VOL. 23
Suyanto SuyantoSuyanto Suyanto
25 Jan 2020
International Journal of Speech Technology | VOL. 23

Indonesian graphemic syllabification using a nearest neighbour classifier and recovery procedure
Edwina Anky Parande ... Suyanto Suyanto
International Journal of Speech Technology | VOL. 22
Edwina Anky Parande, et. al.Edwina Anky Parande ... Suyanto Suyanto
08 Nov 2018
International Journal of Speech Technology | VOL. 22

Optimizing Data Augmentation for Semantic Segmentation on Small-Scale Dataset
Rui Ma ... Pin Tao
-
Rui Ma, et. al.Rui Ma ... Pin Tao
15 Jun 2019
15 Jun 2019

A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation
Azal Ahmad Khan ... Rohitash Chandra
Expert Systems with Applications | VOL. 244
Azal Ahmad Khan, et. al.Azal Ahmad Khan ... Rohitash Chandra
10 Dec 2023
Expert Systems with Applications | VOL. 244

Journal: Heliyon	Publication Date: Nov 1, 2022
Citations: 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Augmented-syllabification of n-gram tagger for Indonesian words and named-entities

Abstract

Talk to us

Similar Papers

More From: Heliyon