Abstract

The Arabic language has several dialects across the twenty-two Arabic-speaking countries in Asia and Africa. Arabic Dialect Identification (ADI) is still a challenging task due to the well-recognized complexity and variations of Arabic dialects. It is noteworthy that Arabic dialects share the majority of tokens. The state-of-the-art solutions have been built upon various machine learning approaches. However, they commonly treat all words equally-likely and thus ignores the importance of dialectal words in response to a given dialect. In this paper, we propose a three-stage neural approach to learn the dialectal semantic representation from a given corpus. Specifically, we first aim to capture the dialect-relevant information, which is then used to model the dialectal vector representation. The goal is to filter away the shared words between dialects to reduce the noisy information fused to the fully connected layer. We introduce two variants, including LSTM-based and Transformer-based. Finally, we empirically evaluate the performance of the proposed solution by a comparative study on real benchmark datasets, including MADAR, NADI, and QADI. Our extensive experiments show that it consistently achieves state-of-the-art performance. Due to the well-recognized challenging of ADI, the improvement margins can be deemed considerable. The code is available on GitHub.11The code is available: https://github.com/amurtadha/arabic-dialect-identification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.