Abstract

As more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.