Abstract
Usage of code-mixed text has increased in re-cent years among Indonesian internet users, who often mix Indonesian-language with English-language text. Normalisation of this code-mixed text into Indonesian needs to be performed to capture the meaning of English parts of the text and process them effectively. We improve a state-of-the-art code-mixed Indonesian-English normalisation system by modifying its pipeline modules. We further analyse the effect of code-mixed normalisation on emotion classification tasks. Our approach significantly improved on a state-of-the-art Indonesian-English code-mixed text normal-isation system in both the individual pipeline modules and the overall system. The new feature set in the language identification module showed an improvement of 4.26% in terms of F1 score. The combination of machine translation and ruleset in the lexical normalisation module improved BLEU score by 25.22% and lowered WER by 62.49%. The use of context in the translation module improved BLEU score by 2.5% and lowered WER by 8.84%. The effectiveness of the overall pipeline normalisation system increased by 32.11% and 33.82%, in terms of BLEU score and WER, respectively. Code-mixed normalisation also improved the accuracy of emotion classification by up to 37.74% in terms of F1 score.
Highlights
One common form of the phenomenon of multilingualism is code-mixing
The model created in this research achieved the highest score, outperforming the condition random field (CRF) model used by Barik et al by 3.33% on precision, 4.92% on recall, 4.26% on F1 score, and 4.01% on accuracy
Another problem in Barik et al.’s model is that the model was unable to properly normalise slang words that were very different from their formal versions
Summary
It is a linguistic phenomenon that mixes two or more language variations in one utterance [1]. This phenomenon can be found in various contexts, including social media [2], news articles [3], lectures [4], and even sermons [5]. A 2015 report noted that 57.5% of Indonesian people are bilingual and 17.4% are trilingual, among whom the most popular language combination is Indonesian, English and Javanese.. A 2015 report noted that 57.5% of Indonesian people are bilingual and 17.4% are trilingual, among whom the most popular language combination is Indonesian, English and Javanese.1 Multilingualistic phenomena such as code-mixing have recently become increasingly common due to more widespread usage of the internet, especially social media A 2015 report noted that 57.5% of Indonesian people are bilingual and 17.4% are trilingual, among whom the most popular language combination is Indonesian, English and Javanese. Multilingualistic phenomena such as code-mixing have recently become increasingly common due to more widespread usage of the internet, especially social media
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have