Abstract

There is an increasing demand for analyzing the contents of social media. However, the process of sentiment analysis in Arabic language especially Arabic dialects can be very complex and challenging. This paper presents details of collecting and constructing a classified corpus of 4180 multi-dialectal Saudi tweets (SDCT). The tweets were annotated manually by five native speakers in two stages. The first stage annotated the tweets as Hijazi, Najdi, and Eastern based on some Saudi regions. The second stage annotated the sentiment as positive, negative, and natural. The annotation process was evaluated using Kappa Score. The validation process used cross validation technique through eight baseline experiments for training different classifier models. The results present that the 10-folds validation provides greater accuracy than 5-folds across the eight experiments and the classification of the Eastern dialects achieved the best accuracy compared to the other dialects with an accuracy of 91.48%.

Highlights

  • Today, there are roughly 6500 spoken languages around the world, and each language involves different multiple dialects [1]

  • The objective of this paper was to enrich Arabic, the language used in Saudi Arabia, by constructing a Saudi corpus based on dialects, and make it available for further research in Arabic studies such as Natural Language processing (NLP) applications

  • This paper presented the methodology used to collect and build a corpus of 4180 multi-dialectal Saudi tweets (SDCT)

Read more

Summary

Introduction

There are roughly 6500 spoken languages around the world, and each language involves different multiple dialects [1]. Arabic is the official language of 22 countries, and it is spoken by over 400 million people. It is considered the fourth language used the most on the Internet [2]. The Gulf region consists of six countries: Saudi Arabia, United Arab Emirates, Qatar, Kuwait, Bahrain, and Oman, where each country has its own dialect. As for Saudi Arabia, each different region has its own dialect. The AD has huge differences between them that can be considered different languages; Arabic language and its dialects required further intensive study and analysis.

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.