Language Threshold for Multilingual Sentiment Analysis System

Yu Heng Kit,Musa Mohd Mokji

doi:10.11113/elektrika.v23n1.446

Abstract

Code-mixed sentences are very common in social media platforms especially in countries such as Malaysia that have more than 2 speaking languages. Although multilingual Bidirectional Encoder Representations from Transformers (mBERT) has the capability of understanding multilingualism, the sentence embeddings obtained from mBERT can be very complex for a code-mixed sentence. This is a challenge in Natural Language processing when processing informal social media text due to its complexity, especially in mixed languages like Malay-English where there is an insufficient amount of training datasets available. Thus, this paper proposes a language threshold to translate the affected words or sentence into a single language sentence and relabel the language of the sentence. The result shows an increase of 8% in accuracy when translating affected words in a sentence at the 60% language threshold using SEC PCA-200.

Full Text