Abstract
PURPOSE/AIM & BACKGROUNDAlthough the Arabic language is spoken in twenty-two countries by more than 250 million speakers, it is still considered by Natural Language Processing NLP practitioners as a low resource language. Formal sources of Arabic texts are typically written in Modern Standard (or Written) Arabic (MSA), which is a form that is used in formal writing and taught in schools to Arabic speakers. However, informal communication among Arabic speakers is through informal local diglossic dialects. A diglossic language is one where the speakers of the same language have varying dialects. In Arabic, there are multiple dialects in different regions of the Arab world: Gulf, Levantine and North Africa. Users commonly communicate in social media using their local dialect rather than the formal MSA. This introduces a core NLP problem for Arabic, which is dialect identification. It is essential to identify the specific dialect prior to performing tasks such as parsing, tokenizing and other downstream tasks such as semantic inferences. Processing massive amounts of data written in these local dialects requires this identification step to improve accuracies, especially for automatic text comprehension tasks. Although Arabic dialects share a majority of common words, it is not uncommon for the same word to have different meanings across dialects. In addition to improving NLP task accuracies, Arabic Dialect Identification ADI enables a finer-grained demographic identification for mining texts related to consumer reports, health forums, entertainment and tourism reviews, and many others which ultimately lead to improved services for each demographic.The problem of ADI has been addressed by several studies such as (Al-Walaie & Khan, 2017), and (Harrat et al., 2019). Some works focus mainly on curating data sets for the problem such as the Sham dataset proposed by (Abu Kwaik et al., 2018).In this work we focus on both tasks: we curate an Arabic dialect dataset for two variants of Arabic (Saudi Arabian and Egyptian) and we train supervised machine learning models to address the identification task.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have