Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields

Abdulmohsen Al-Thubaity,Abdulrahman Almuhareb,Atheer Alkhalifa,Waleed Alsanie

doi:10.1109/access.2020.3018885

Abdulmohsen Al-Thubaity, Abdulrahman Almuhareb + Show 2 more

Open Access

https://doi.org/10.1109/access.2020.3018885

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 53	License type: CC BY 4.0

Affiliation: King Abdulaziz City for Science and Technology

Abstract

Arabic diacritics play a significant role in distinguishing words with the same orthography but different meanings, pronunciations, and syntactic functions. The presence of Arabic diacritics can be useful in many natural language processing applications, such as text-to-speech tasks, machine translation, and part-of-speech tagging. This article discusses the use of bidirectional long short-term memory neural networks with conditional random fields for Arabic diacritization. This approach requires no morphological analyzers, dictionary, or feature engineering, but rather uses a sequence-to-sequence schema. The input is a sequence of characters that constitute the sentence, and the output consists of the corresponding diacritic(s) for each character in that sentence. The performance of the proposed approach was examined using four datasets with different sizes and genres, namely, the King Abdulaziz City for Science and Technology text-to-speech (KACST TTS) dataset, the Holy Quran, Sahih Al-Bukhary, and the Penn Arabic Treebank (ATB). For training, 60% of the sentences were randomly selected from each dataset, 20% were selected for validation, and 20% were selected for testing. The trained models achieved diacritic error rates of 3.41%, 1.34%, 1.57%, and 2.13% and word error rates of 14.46%, 4.92%, 5.65%, and 8.43% on the KACST TTS, Holy Quran, Sahih Al-Bukhary, and ATB datasets, respectively. Comparison of the proposed method with those used in other studies and existing systems revealed that its results are comparable to or better than those of the state-of-the-art methods.

Highlights

Diacritics are marks written above or below words or letters in several languages such as Arabic [1], Turkish [2], and Romanian [3]
We investigated the performance of a supervised deep learning approach, bidirectional long short-term memory (BiLSTM) with conditional random fields (CRFs) [14] for Arabic diacritization
The proposed method of Arabic diacritic restoration does not employ any type of morphological analyzer, a dictionary, rules, or any kind of feature engineering. It is solely based on data and is distinct from other Arabic diacritic restoration efforts that employ long short-term memory (LSTM) networks by using the sequence of characters that constitute the sentence as input and their corresponding diacritics as output

Summary

INTRODUCTION

Diacritics are marks written above or below words or letters in several languages such as Arabic [1], Turkish [2], and Romanian [3]. The existing systems for Arabic diacritic restoration typically consider the problem either within morphological disambiguation or as a standalone problem In the latter case, most proposed systems are based on dictionaries and rules, language resources, or feature engineering approaches that employ linguistic information. The proposed method of Arabic diacritic restoration does not employ any type of morphological analyzer, a dictionary, rules, or any kind of feature engineering It is solely based on data and is distinct from other Arabic diacritic restoration efforts that employ long short-term memory (LSTM) networks by using the sequence of characters that constitute the sentence as input and their corresponding diacritics as output.

RELATED WORK

BACKGROUND

RESULTS

VIII. CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Location Extraction from Twitter Messages Using a Bidirectional Long Short-Term Memory Neural Network with Conditional Random Field Model
Zi Chen ... Samsung Lim
-
Zi Chen, et. al.Zi Chen ... Samsung Lim
01 Jan 2020
01 Jan 2020

Automatic Methods and Neural Networks in Arabic Texts Diacritization: A Comprehensive Survey
Manar M Almanea
IEEE Access | VOL. 9
Manar M AlmaneaManar M Almanea
01 Jan 2020
IEEE Access | VOL. 9

Deep Learning for Natural Language Processing
Jiajun Zhang ... Chengqing Zong
-
Jiajun Zhang, et. al.Jiajun Zhang ... Chengqing Zong
01 Jan 2019
01 Jan 2019

A Persian part of speech tagging system using the long short-term memory neural network
Abbas Koochari ... Vahid Hajihashemi
-
Abbas Koochari, et. al.Abbas Koochari ... Vahid Hajihashemi
23 Dec 2020
23 Dec 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access