Abstract

Two-year-old children who start learning to speak generally spell a polysyllabic word by flipping onsets of consecutive syllables. Sometimes they speak unclearly, hard to understand since the flipped onsets produce another word that has a much different meaning. For instance, two onsets in an English word “me.lon” (large round fruit of a plant of the gourd family) are flipped to produce another word “le.mon” (an acid fruit). In Bahasa Indonesia, such cases are quite common. For examples, two onsets in word “ba.tu” (stone) are swapped to be “ta.bu” (taboo), two onsets in “be.sar” (big) are flipped to be “se.bar” (spread), two onsets in “ru.mah” (house) are swapped to be “mu.rah” (cheap), etc. A preliminary study on 50k Indonesian formal words shows that the ratio between frequencies of the flipped-onset-bigrams and the 50 most frequent original syllable-bigrams is quite high, up to 13.09%. This research investigates the adoption of such phenomenon to enhances a bigram orthographic syllabification model that is commonly poor for out-of-vocabulary words. A five-fold cross-validation on 50k Indonesian formal words proves that the flipping onsets enhances the bigram orthographic syllabification, where the syllable error rate (SER) is relatively reduced by 18.02%. The method is also capable of producing quite low SER for a tiny trainset of 1k words to generalize 10k unseen words. Besides, it can be simply generalized to be applied to other languages as well as named-entities using a few specific knowledge related to the sets of vowels, diphthongs, and consonants.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.