Location Analysis for Arabic COVID-19 Twitter Data Using Enhanced Dialect Identification Models

Nader Essam,Mohsen Rashwan,Md Maruf Hasan,Majed A Alshamari,Abdullah M Moussa,Amna Asif,Sherif Abdou,Shaheen Khatoon,Khaled M Elsayed

doi:10.3390/app112311328

Abstract

The recent surge of social media networks has provided a channel to gather and publish vital medical and health information. The focal role of these networks has become more prominent in periods of crisis, such as the recent pandemic of COVID-19. These social networks have been the leading platform for broadcasting health news updates, precaution instructions, and governmental procedures. They also provide an effective means for gathering public opinion and tracking breaking events and stories. To achieve location-based analysis for social media input, the location information of the users must be captured. Most of the time, this information is either missing or hidden. For some languages, such as Arabic, the users’ location can be predicted from their dialects. The Arabic language has many local dialects for most Arab countries. Natural Language Processing (NLP) techniques have provided several approaches for dialect identification. The recent advanced language models using contextual-based word representations in the continuous domain, such as BERT models, have provided significant improvement for many NLP applications. In this work, we present our efforts to use BERT-based models to improve the dialect identification of Arabic text. We show the results of the developed models to recognize the source of the Arabic country, or the Arabic region, from Twitter data. Our results show 3.4% absolute enhancement in dialect identification accuracy on the regional level over the state-of-the-art result. When we excluded the Modern Standard Arabic (MSA) set, which is formal Arabic language, we achieved 3% absolute gain in accuracy between the three major Arabic dialects over the state-of-the-art level. Finally, we applied the developed models on a recently collected resource for COVID-19 Arabic tweets to recognize the source country from the users’ tweets. We achieved a weighted average accuracy of 97.36%, which proposes a tool to be used by policymakers to support country-level disaster-related activities.

Highlights

On 30 January 2020, the World Health Organization declared COVID-19 a pandemic after the massive spread of the virus SARS-CoV-2 in many countries all over the world [1]
Arabic Online Commentary (AOC) consists of 3 million Modern Standard Arabic (MSA) and dialectal comments, with 108k of them labeled by utilizing crowdsourcing
We have over 3% absolute gain in the 3-way classification task, and over 3.4% absolute gain when dealing with the 4-way classification task

Summary

Introduction

On 30 January 2020, the World Health Organization declared COVID-19 a pandemic after the massive spread of the virus SARS-CoV-2 in many countries all over the world [1]. In [7], the author states that the ministry of health of Saudi Arabia has used several accounts on Twitter to populate many health-related hashtags to provide governmental pieces of advice to Saudi Arabian citizens This is evidence that decision-makers are considering social media as important channels for communicating with people. Officials usually use MSA, the formal version of the language, in educational organizations and pan-Arab news broadcasting, which is different from the varieties that are spoken in daily communications by native speakers [9,10,11] These daily varieties constitute the dialects of Arabic that can be classified based on some common linguistic features of geographical locations.

Related Work

The Feature Extractor

The Classifier System

BERT Model

Experiments and Discussion

Embeddings Types

AOC-Based Results

Findings

Discussion

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied sciences	Publication Date: Nov 30, 2021
Citations: 8	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Location Analysis for Arabic COVID-19 Twitter Data Using Enhanced Dialect Identification Models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences

Lead the way for us

Similar Papers

Meeting Challenges of Modern Standard Arabic and Saudi Dialect Identification
Yahya Aseri ... Mohamed Mohanna
-
Yahya Aseri, et. al.Yahya Aseri ... Mohamed Mohanna
26 Mar 2022
26 Mar 2022

Creation of annotated country-level dialectal Arabic resources: An unsupervised approach
Maha J Althobaiti
Natural language engineering | VOL. 28
Maha J AlthobaitiMaha J Althobaiti
09 Aug 2021
Natural language engineering | VOL. 28

AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling
Abdellah El Mekki ... Ahmed Khoumsi
Information Processing and Management | VOL. 59
Abdellah El Mekki, et. al.Abdellah El Mekki ... Ahmed Khoumsi
06 May 2022
Information Processing and Management | VOL. 59

A Lexical Distance Study of Arabic Dialects
Kathrein Abu Kwaik ... Simon Dobnik
Procedia computer science | VOL. 142
Kathrein Abu Kwaik, et. al.Kathrein Abu Kwaik ... Simon Dobnik
01 Jan 2018
Procedia computer science | VOL. 142

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Location Analysis for Arabic COVID-19 Twitter Data Using Enhanced Dialect Identification Models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences