Abstract

This research is a corpus-based study which aims to reveal the ability of the recent morphological analyzers to handle the ambiguities that apear in the Egyptian Arabic electronic texts written in Social Media. The research evaluates the automatic annotation of the Egyptian Arabic Penn-Treebank ARZ ATB using CALIMA, the Columbian Arabic diaLectal Morphological Analyzer. The corpus is collected by Linguistic Consortium Data as a part of BOLT project, which aims to develop a technology that enables English speakers to retrieve and understand information from informal foreign language sources including chat, text messaging, and spoken conversations. In order to reach better results, the research concentrated on the nouns category. For achieving the research task, a gold standard was built by using the most frequent 1723 nominal word types from 6543 word types of 16226 words selected randomly from the ARZ ATB corpus. The total number of the collected morphemes was 2798. Recall, Precision, F-score, and accuracy of the tool performance were calculated, the recall was 89%, the precision was 94.5%, F-score was 93.7% and the accuracy reached to 93%. The errors were classified to reveal the main morphological ambiguities that the tool couldn’t handle due to the development of the written form of the Egyptian dialect in social media. According to the results, the Orthographic variations that appeared in the Egyptian Arabic dialects reflected the lack of an authorized writing system governs the using of the dialect in its written form. Thus, gathering and describing the main orthographic variations is imperative to handle the ambiguities that are revealed in the study.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.