Abstract
As one of the statistical-based models, an n-gram syllabification commonly gives a high syllable error rate (SER) for Bahasa Indonesia, one of the low-resource languages, since it fails for a high out-of-vocabulary (OOV) rate. Two previous models: bigram-syllabification with flipping onsets (BFO) and a combination of bigram with backoff smoothing based on phonological similarity (CBSPS), which use augmentation methods, can reduce the OOV rate. However, there are two problems in both BFO and CBSPS. First, they use an n-gram that is applied syllable-level, instead of grapheme-level, so that they suffer on the sparsity of n-grams. Second, they rely on a procedure to detect the positions of both vowels and diphthongs. Both problems make them not capable of distinguishing diphthongs from derivative words as well as syllabifying named-entities, which have many ambiguities related to vowels and semi-vowels. In this paper, a syllabification based on an n-gram tagger, which is applied on grapheme-level and does not rely on both vowel and diphthong detections, is developed to solve both problems. Besides, three data augmentation methods are exploited to enrich the dataset. The 5-fold cross-validations (5-FCV) using both datasets of 50 k words and 15 k named-entities show that the proposed augmented-syllabification of n-gram tagger (ASnGT) model is significantly better than both BFO and CBSPS. It is also significantly better than the fuzzy k-nearest neighbor in every class (FkNNC)-based model for formal words and named-entities. However, it suffers from derivative words, where it cannot easily distinguish them from both absorption words and terms of foreign languages. Besides, it also undergoes some foreign named-entities.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.