Abstract

Usage of code-mixed text has increased in re-cent years among Indonesian internet users, who often mix Indonesian-language with English-language text. Normalisation of this code-mixed text into Indonesian needs to be performed to capture the meaning of English parts of the text and process them effectively. We improve a state-of-the-art code-mixed Indonesian-English normalisation system by modifying its pipeline modules. We further analyse the effect of code-mixed normalisation on emotion classification tasks. Our approach significantly improved on a state-of-the-art Indonesian-English code-mixed text normal-isation system in both the individual pipeline modules and the overall system. The new feature set in the language identification module showed an improvement of 4.26% in terms of F1 score. The combination of machine translation and ruleset in the lexical normalisation module improved BLEU score by 25.22% and lowered WER by 62.49%. The use of context in the translation module improved BLEU score by 2.5% and lowered WER by 8.84%. The effectiveness of the overall pipeline normalisation system increased by 32.11% and 33.82%, in terms of BLEU score and WER, respectively. Code-mixed normalisation also improved the accuracy of emotion classification by up to 37.74% in terms of F1 score.

Highlights

  • One common form of the phenomenon of multilingualism is code-mixing

  • The model created in this research achieved the highest score, outperforming the condition random field (CRF) model used by Barik et al by 3.33% on precision, 4.92% on recall, 4.26% on F1 score, and 4.01% on accuracy

  • Another problem in Barik et al.’s model is that the model was unable to properly normalise slang words that were very different from their formal versions

Read more

Summary

Introduction

It is a linguistic phenomenon that mixes two or more language variations in one utterance [1]. This phenomenon can be found in various contexts, including social media [2], news articles [3], lectures [4], and even sermons [5]. A 2015 report noted that 57.5% of Indonesian people are bilingual and 17.4% are trilingual, among whom the most popular language combination is Indonesian, English and Javanese.. A 2015 report noted that 57.5% of Indonesian people are bilingual and 17.4% are trilingual, among whom the most popular language combination is Indonesian, English and Javanese.1 Multilingualistic phenomena such as code-mixing have recently become increasingly common due to more widespread usage of the internet, especially social media A 2015 report noted that 57.5% of Indonesian people are bilingual and 17.4% are trilingual, among whom the most popular language combination is Indonesian, English and Javanese. Multilingualistic phenomena such as code-mixing have recently become increasingly common due to more widespread usage of the internet, especially social media

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call