Stochastic Models for Automatic Diacritics Generation of Arabic Names

Fawaz S Al-Anzi

doi:10.1007/s10579-004-2323-6

Abstract

In this paper, two new models for generating diacritics for Arabic names are proposed. The first proposed model is called N-gram model. It is a stochastic model that is based on generating a corpus database of N-grams extracted from a large database of names with their corresponding probability according to an N-gram position in a text (Bhal et al., 1983). i.e., the probability that an N-gram has happened in a position x, where x can be the first, second,... or ith position in the text. Replacing the N-grams with their patterns extends the first model to the second proposed stochastic model. It is called the Envelope model. These two proposed models are unique in being the first attempt to solve the problem in Arabic text diacritics generation using linguistic constraints stochastic approaches that are neither grammatical nor pure lexical based (Merialdo, 1991; Ney and Essen, 1991; Schukat-Talamazzini et al., 1992; Witschel and Niedermair, 1992). This methodology helps in reducing size and complexity of software implementation of the proposed models and makes it easier to update and port across different platforms.

Full Text