Open Vocabulary Arabic Diacritics Restoration

Yasser Hifny

doi:10.1109/lsp.2019.2933721

Abstract

Diacritics restoration is a necessary component in order to develop Arabic text to speech systems. When diacritics are present, the phonetic transcription algorithm can be implemented based on a few rules. Restoring Arabic diacritics based on language model scoring is the dominant approach. A fixed vocabulary is usually used to build the language model used for scoring. Since Arabic is a morphologically rich language, the number of the Out-of-vocabulary (OOV) words is large and the diacritization algorithm fails to restore diacritics for these words. In this letter, we present a novel approach to support open vocabulary diacritics restoration based on the Byte Pair Encoding (BPE) method. The BPE method segments the words into variable length sub-word units and allows open vocabulary from fixed sub-word units dictionary. On the Tashkeela diacritization task, this open vocabulary approach outperforms the word and character based methods commonly used in the literature.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Open Vocabulary Arabic Diacritics Restoration

Abstract

Talk to us

Similar Papers

More From: IEEE Signal Processing Letters

Lead the way for us

Journal: IEEE Signal Processing Letters	Publication Date: Oct 1, 2019
Citations: 37

Similar Papers

BPE-Dropout: Simple and Effective Subword Regularization
Ivan Provilkov ... Elena Voita
-
Ivan Provilkov, et. al.Ivan Provilkov ... Elena Voita
01 Jan 2020
01 Jan 2020

Byte Pair Encoding is Suboptimal for Language Model Pretraining
Kaj Bostrom ... Greg Durrett
-
Kaj Bostrom, et. al.Kaj Bostrom ... Greg Durrett
01 Jan 2020
01 Jan 2020

A Study of BPE-based Language Modeling for Open Vocabulary Latin Language OCR
Wenping Hu ... Qiang Huo
-
Wenping Hu, et. al.Wenping Hu ... Qiang Huo
01 Sep 2020
01 Sep 2020

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling
Kavya Manohar ... Rajeev Rajan
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2023
Kavya Manohar, et. al.Kavya Manohar ... Rajeev Rajan
04 Nov 2023
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Open Vocabulary Arabic Diacritics Restoration

Abstract

Talk to us

Similar Papers

More From: IEEE Signal Processing Letters