Abstract
Feature extraction is a critical stage of digital speech processing systems. Quality of features is of great importance to provide a solid foundation upon which the subsequent stages stand. Distinctive phonetic features (DPFs) are one of the most representative features of the speech signals. The significance of DPFs is in their ability to provide abstract description of the places and manners of articulation of the language phonemes. A phoneme's DPF element reflects unique articulatory information about that phoneme. Therefore, there is a need to discover and investigate each DPF element individually in order to achieve a deeper understanding and to come up with a descriptive model for each one. Such fine-grained modeling will satisfy the uniqueness of each DPF element. In this paper, the problem of DPF modeling and extraction of modern standard Arabic is tackled. Due to the remarkable success of deep neural networks (DNNs) that are initialized using deep belief networks (DBNs) in serving DSP applications and its capability of extracting highly representative features from the raw data, we exploit its modeling power to investigate and model the DPF elements. DNN models are compared with the classical multilayer perceptron (MLP) models. The representativeness of several acoustic cues for different DPF elements was also measured. This paper is based on formalizing DPF modeling problem as a binary classification problem. Because the DPF elements are highly imbalanced data, evaluating the quality of models is a very tricky process. This paper addresses the proper evaluation measures satisfying the imbalanced nature of the DPF elements. After modeling each element individually, the two top-level DPF extractors are designed: MLP- and DNN-based extractors. The results show the quality of DNN models and their superiority over MLPs with accuracies of 89.0% and 86.7%, respectively.
Highlights
Feature extraction is an essential preprocessing stage of digital speech processing systems serving several applications such as automatic speech recognition (ASR), speaker identification, speech prosody analysis, and many others
This paper reports the work of modeling Distinctive phonetic features (DPFs) of Modern Standard Arabic (MSA)
The study reported in [10] addressed DPF element extraction of American English using a multilayer perceptron (MLP) that was modeled using Deep Neural Networks (DNNs)
Summary
Feature extraction is an essential preprocessing stage of digital speech processing systems serving several applications such as automatic speech recognition (ASR), speaker identification, speech prosody analysis, and many others. A DPF vector is a set of binary elements that uniquely describes the articulatory and phonetic properties of phonemes [1]. That is, generating the phoneme /b/ involves vocal folds’ vibration, which is a brain activity that is described by setting the voicing element as ‘‘+’’. The ability of DPFs to describe speech signal contextually and phonetically makes them of great advantage in enhancing systems performance and robustness [4]. Those benefits can be maximized if language-specific studies are conducted.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have