Abstract

Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen Grimes, Haejoong Lee, Jonathan Wright, Stephanie Strassel, Nizar Habash, Ramy Eskander, Owen Rambow. Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP). 2014.

Highlights

  • The language used in social media expresses many differences from other written genres: its vocabulary is informal with intentional deviations from standard orthography such as repeated letters for emphasis; typos and non-standard abbreviations are common; and non-linguistic content is written out, such as laughter, sound representations, and emoticons

  • We describe the process of creating such a novel resource at the Linguistic Data Consortium (LDC)

  • The rest of this paper describes the collection of Egyptian Short Messaging System (SMS) and Chat data and the creation of a parallel text corpus of Arabizi and Arabic script for the Defense Advanced Research Projects Agency (DARPA) Broad Operational Language Translation (BOLT) program

Read more

Summary

Introduction

The language used in social media expresses many differences from other written genres: its vocabulary is informal with intentional deviations from standard orthography such as repeated letters for emphasis; typos and non-standard abbreviations are common; and non-linguistic content is written out, such as laughter, sound representations, and emoticons. This situation is exacerbated in the case of Arabic social media for two reasons. Social media communication in Arabic takes place using a variety of orthographies and writing systems, including Arabic script, Arabizi, and a mixture of the two. Not all social media communication uses Arabizi, the use of Arabizi is prevalent enough to pose a challenge for Arabic NLP research

Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call