Abstract

Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%.

Highlights

  • Arabic is a morphologically rich and complex language, which presents significant challenges for natural language processing and its applications

  • We presented a comparative study on dialect identification of Arabic language using social media texts; which is considered as a very hard and challenging task

  • We studied the impact of the character n-gram Markov models and the Naive Bayes classifiers using three n-gram models, unigram, bi-gram and tri-gram

Read more

Summary

Introduction

Arabic is a morphologically rich and complex language, which presents significant challenges for natural language processing and its applications. It is the official language in 22 countries spoken by more than 350 million people around the world. The Arabic language exists in a state of diglossia where the standard form of the language, Modern Standard Arabic (MSA) and the regional dialects (AD) live side-by-side and are closely related (Elfardy and Diab, 2013). Arabic has more than 22 dialects; some countries share the same dialect, while many dialects may exist alongside MSA within the same Arab country. Arabic dialects (AD) or colloquial languages are spoken varieties of Arabic and the daily language of several people. Arabic dialects and MSA share a considerable number of semantic, syntactic, morphological and lexical features; these features have many differences (Al-Sabbagh and Girju, 2013)

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call