Abstract

In this paper, we analyze the impact of five Arabic dialects on the front-end and pronunciation dictionary components of an Automatic Speech Recognition (ASR) system. We use ASR's phonetic decision tree as a diagnostic tool to compare the robustness of MFCC and MLP front-ends to dialectal variations in the speech data and found that MLP Bottle-Neck features are less robust to such variations. We also perform a rule-based analysis of the pronunciation dictionary, which enables us to identify dialectal words in the vocabulary and automatically generate pronunciations for unseen words. We show that our technique produces pronunciations with an average phone error rate 9.2%. Arabic language is characterized by its multitude of dialects. Although Modern Standard Arabic (MSA) is used in writing, TV/radio broadcasts and for formal communication, all informal communication is typically carried out in one of the regional dialects of Arabic. Dialectal variations influence the pronunciation dictionary, acoustic and language models in an ASR. Previous works on dialectal Arabic ASR include cross- dialectal data sharing (1), improved pronunciation and language modeling (2, 3), etc. In this paper, we describe our experiments on a dialectal Arabic speech database, where we focus on analyzing the behavior of different front-ends and pronunciation dictionary due to dialectal variations between speakers. We evaluate Mel-Frequency Cepstral Coefficients (MFCC) and Multi-Layer Perceptrons (MLP), on their ability to handle these variations that arise due to different dialects. Extending our previous work on gender normalization (4), we use phonetic decision trees as a diagnostic tool to analyze the influence of dialect in the clustered models. We introduce questions pertaining to dialect in addition to context in the building of the decision tree. We then build the tree to cluster the contexts and calculate the number of leaves that belong to branches with dialectal questions. The ratio of such 'dialectal' models to the total model size is used as a measure for dialect normalization. The higher the ratio, the more models are affected by the dialect, hence less normalization and vice versa. We further extend our analysis to the pronunciation dictionary, where we investigate ways to generate rule-based pronunciations for unseen words in a dialect with minimum manual effort. Our setup features a 'Pan-Arabic' dictionary, which contains pronunciations typically found in five Arabic dialects. We analyze the pronunciation variants in our common dictionary using acoustic model alignments to derive the dialect-specific pronunciations for each word. This forms the source of our rule-learning algorithm which maps word pronunciations from one dialect to another. These rules are then used to generate pronunciations for unseen words and the accuracy is estimated.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call