Abstract

Deep learning has emerged as a new area of machine learning research. It is an approach that can learn features and hierarchical representation purely from data and has been successfully applied to several fields such as images, sounds, text and motion. The techniques developed from deep learning research have already been impacting the research on Natural Language Processing (NLP). Arabic diacritics are vital components of Arabic text that remove ambiguity from words and reinforce the meaning of the text. In this paper, a Deep Belief Network (DBN) is used as a diacritizer for Arabic text. DBN is an algorithm among deep learning that has recently proved to be very effective for a variety of machine learning problems. We evaluate the use of DBNs as classifiers in automatic Arabic text diacritization. The DBN was trained to individually classify each input letter with the corresponding diacritized version. Experiments were conducted using two benchmark datasets, the LDC ATB3 and Tashkeela. Our best settings achieve a DER and WER of 2.21% and 6.73%, receptively, on the ATB3 benchmark with an improvement of 26% over the best published results. On the Tashkeela benchmark, our system continues to achieve high accuracy with a DER of 1.79% and 14% improvement.

Highlights

  • Arabic is one of the six official languages of the United Nations (UN), which belongs to the Semitic languages used by Arabs and Muslims all over the world

  • Selecting the appropriate batch size can improve the performance of the Deep Belief Network (DBN) model and shorten the run-time

  • After a number of experiments, we found the best architecture of the DBN with three hidden layers of Restricted Boltzmann Machines (RBMs)

Read more

Summary

Introduction

Arabic is one of the six official languages of the United Nations (UN), which belongs to the Semitic languages used by Arabs and Muslims all over the world. Arabic speaking population in the world is around 400 million native speakers and 1 billion. Muslims with 30 different dialects [1]. The Arabic language alphabet consists of 28 letters in addition to the Hamza. The orientation of writing in Arabic is from right to left, there is no capitalization in Arabic and Arabic letters change shape according to their position in the word [2]. In the Arabic language, diacritic marks are used to clarify how to pronounce a letter.

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call