Abstract

Code-mixed sentences are very common in social media platforms especially in countries such as Malaysia that have more than 2 speaking languages. Although multilingual Bidirectional Encoder Representations from Transformers (mBERT) has the capability of understanding multilingualism, the sentence embeddings obtained from mBERT can be very complex for a code-mixed sentence. This is a challenge in Natural Language processing when processing informal social media text due to its complexity, especially in mixed languages like Malay-English where there is an insufficient amount of training datasets available. Thus, this paper proposes a language threshold to translate the affected words or sentence into a single language sentence and relabel the language of the sentence. The result shows an increase of 8% in accuracy when translating affected words in a sentence at the 60% language threshold using SEC PCA-200.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call