Abstract

Diacritics restoration is a necessary component in order to develop Arabic text to speech systems. When diacritics are present, the phonetic transcription algorithm can be implemented based on a few rules. Restoring Arabic diacritics based on language model scoring is the dominant approach. A fixed vocabulary is usually used to build the language model used for scoring. Since Arabic is a morphologically rich language, the number of the Out-of-vocabulary (OOV) words is large and the diacritization algorithm fails to restore diacritics for these words. In this letter, we present a novel approach to support open vocabulary diacritics restoration based on the Byte Pair Encoding (BPE) method. The BPE method segments the words into variable length sub-word units and allows open vocabulary from fixed sub-word units dictionary. On the Tashkeela diacritization task, this open vocabulary approach outperforms the word and character based methods commonly used in the literature.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.