Abstract

Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for the translation of texts of social media. More precisely, this paper focuses on the Tunisian Dialect of Arabic (TAD) with an application on automatic machine translation for a social media text into MSA and any other target language. Linguistic tools such as a bilingual TAD-MSA lexicon and a set of grammatical mapping rules are collaboratively constructed and exploited in addition to a language model to produce MSA sentences of Tunisian dialectal sentences. This work is a first-step towards collaboratively constructed semantic and lexical resources for Arabic Social Media within the ASMAT (Arabic Social Media Analysis Tools) project.

Highlights

  • The explosive growth of social media has led to a wide range of new challenges for machine translation and language processing

  • Building any Natural Language Processing (NLP) tool for texts extracted from social media is very challenging and daunting task and always be limited by the rapid changes in the social media

  • This paper presents our effort to create linguistic resources such as a bilingual lexicon, a set of grammatical mapping rule and a ruel-based translation and disambiguation system for the translation of any social media text from Tunisian Dialect of Arabic (TDA) into Modern Standard Arabic (MSA)

Read more

Summary

Introduction

The explosive growth of social media has led to a wide range of new challenges for machine translation and language processing. This paper deal with Arabic language and its variants for the analysis of social media and the collaborative construction of linguistic tools, such as lexical dictionaries and grammars and their exploitation in NLP applications, such as translation technologies. Arabic is considered as morphologically rich and complex language, which presents significant challenges for NLP and its applications. It is the official language in 22 countries spoken by more than 350 million people around the world. A bilingual TDA-MSA lexicon and a set of TDA mapping rules for the social media context are collaboratively constructed These tools are exploited in addition to a language model extracted from MSA corpus, to produce MSA sentences of the Tunisian dialectal sentences of social media.

Related Work
The Tunisian Dialect of Arabic and its Challenges in Social Media
Collaboratively Constructed Linguistic Tools for TDA
Evaluations
Findings
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call