Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script

Ramy Eskander,Owen Rambow,Mohamed Al-Badrashiny,Nizar Habash

doi:10.3115/v1/w14-3901

Abstract

Arabic on social media has all the properties of any language on social media that make it tough for natural language processing, plus some specific problems. These include diglossia, the use of an alternative alphabet (Roman), and code switching with foreign languages. In this paper, we present a system which can process Arabic written in Roman alphabet (“Arabizi”). It identifies whether each word is a foreign word or one of another four categories (Arabic, name, punctuation, sound), and transliterates Arabic words and names into the Arabic alphabet. We obtain an overall system performance of 83.8% on an unseen test set.

Highlights

IntroductionWritten language used in social media shows differences from that in other written genres: the vocabulary is informal (and sometimes the syntax is as well); there are intentional deviations from standard orthography (such as repeated letters for emphasis); there are typos; writers use non-standard abbreviations; non-linguistic sounds are written (haha); punctuation is used creatively; non-linguistic signs such as emoticons often compensate for the absence of a broader communication channel in written communication (which excludes, for example, prosody or visual feedback); and, most importantly for this paper, there frequently is code switching
Written language used in social media shows differences from that in other written genres: the vocabulary is informal; there are intentional deviations from standard orthography; there are typos; writers use non-standard abbreviations; non-linguistic sounds are written; punctuation is used creatively; non-linguistic signs such as emoticons often compensate for the absence of a broader communication channel in written communication; and, most importantly for this paper, there frequently is code switching
The Arabic language is a collection of varieties: Modern Standard Arabic (MSA), which is used in formal settings, and different forms of Dialectal Arabic (DA), which are commonly used informally

Summary

Introduction

Written language used in social media shows differences from that in other written genres: the vocabulary is informal (and sometimes the syntax is as well); there are intentional deviations from standard orthography (such as repeated letters for emphasis); there are typos; writers use non-standard abbreviations; non-linguistic sounds are written (haha); punctuation is used creatively; non-linguistic signs such as emoticons often compensate for the absence of a broader communication channel in written communication (which excludes, for example, prosody or visual feedback); and, most importantly for this paper, there frequently is code switching These facts pose a wellknown problem for natural language processing of social media texts, which has become an area of interest as applications such as sentiment analysis, information extraction, and machine translation turn to this genre. Code switching is common in many linguistic communities, for example among South Asians

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2014
Citations: 48	License type: cc-by

Similar Papers

Foreign Language Vocabulary in Indonesian Journalism Variety
Subardi Agan ... Encil Puspitoningrum
Wacana : Jurnal Bahasa, Seni, dan Pengajaran | VOL. 5
Subardi Agan, et. al.Subardi Agan ... Encil Puspitoningrum
10 Feb 2022
Wacana : Jurnal Bahasa, Seni, dan Pengajaran | VOL. 5

İbrahim el-Yâzicî'nin Arap Diline Katkıları
Hüseyin Günday ... Nesrin Dursun
Bilimname | VOL. 2019
Hüseyin Günday, et. al.Hüseyin Günday ... Nesrin Dursun
31 Oct 2019
Bilimname | VOL. 2019

Monolingual 2- to 3-Year-Old Children's Understanding of Foreign Words
Hyuna Lee ... Hyun-Joo Song
Korean Journal of Child Studies | VOL. 37
Hyuna Lee, et. al.Hyuna Lee ... Hyun-Joo Song
31 Aug 2016
Korean Journal of Child Studies | VOL. 37

Creating an effective code-switched ad for monolinguals: the influence of brand origin and foreign language familiarity
Ying-Ching Lin ... Jun-Yi Hsieh
International Journal of Advertising | VOL. 36
Ying-Ching Lin, et. al.Ying-Ching Lin ... Jun-Yi Hsieh
14 Jun 2016
International Journal of Advertising | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script

Abstract

Highlights

Summary

Talk to us

Similar Papers