Abstract

Malay social media text is a text written on social media networks like Twitter. Commonly, this text comprises non-standard words, filled with dialects, foreign languages, word abbreviations, grammatical neglect, spelling errors, and many more. It is well known that this type of text is difficult to process due to its high noise and distinct text structure. Such problems can be resolved using rigorous text normalization, which is critical before any technique can be implemented and evaluated on social media text. In this paper, an improved normalization method towards Malay social media text was proposed by converting non-standard Malay words using a rule-based model. The method normalizes common language words often used by Malaysian users, such as non-standard Malay (like dialect and slangs), Romanized Arabic, and English words. Thus, a Malay text normalizer was proposed using a set of rules that extend across different domains of natural language processing (NLP) and is expected to address the challenges of processing Malay social media text. This study implements the proposed Malay text normalizer in a Part-of-Speech (POS) tagging application to evaluate the normalizer’s performance. The implementation demonstrates a substantial improvement in the POS tagging efficiency over several pre-processing stages, with an improvement of accuracy up to 31.8%. The increase of accuracy in the POS tagging indicates two main points. First, the Malay text normalizer’s rules improve the performance of a Malay text normalizer on social media text. Second, our proposed Malay text normalizer has successfully improved the POS tagging percentage and demonstrates the importance of normalized pre-processing in any NLP application.

Highlights

  • Twitter is among the most influential social networks globally after Facebook, and it is expected to remain a common choice for years to come [1]

  • This study uses a collection of Malay tweets and domaindependent information such as the one in the standardization repository to evaluate the proposed Malay text normalizer by applying it on POS tagging

  • The text normalizer for Malay social media text was built based on a rule-based approach

Read more

Summary

Introduction

Twitter is among the most influential social networks globally after Facebook, and it is expected to remain a common choice for years to come [1]. A short text, called tweet writing, is limited to only 280 letters [2][3]. This restriction leads users to engage more creatively using a nonstandard way of writing. According to the study published by [1], from the first quarter of 2010 to the fourth quarter of 2018, there were 1,318 million users worldwide active monthly on Twitter accounts. These statistics clearly show that Twitter has an extensive social media text database (or texts written in colloquial or non-standard language)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call