Abstract

Feature extraction is a critical stage of digital speech processing systems. Quality of features is of great importance to provide a solid foundation upon which the subsequent stages stand. Distinctive phonetic features (DPFs) are one of the most representative features of the speech signals. The significance of DPFs is in their ability to provide abstract description of the places and manners of articulation of the language phonemes. A phoneme's DPF element reflects unique articulatory information about that phoneme. Therefore, there is a need to discover and investigate each DPF element individually in order to achieve a deeper understanding and to come up with a descriptive model for each one. Such fine-grained modeling will satisfy the uniqueness of each DPF element. In this paper, the problem of DPF modeling and extraction of modern standard Arabic is tackled. Due to the remarkable success of deep neural networks (DNNs) that are initialized using deep belief networks (DBNs) in serving DSP applications and its capability of extracting highly representative features from the raw data, we exploit its modeling power to investigate and model the DPF elements. DNN models are compared with the classical multilayer perceptron (MLP) models. The representativeness of several acoustic cues for different DPF elements was also measured. This paper is based on formalizing DPF modeling problem as a binary classification problem. Because the DPF elements are highly imbalanced data, evaluating the quality of models is a very tricky process. This paper addresses the proper evaluation measures satisfying the imbalanced nature of the DPF elements. After modeling each element individually, the two top-level DPF extractors are designed: MLP- and DNN-based extractors. The results show the quality of DNN models and their superiority over MLPs with accuracies of 89.0% and 86.7%, respectively.

Highlights

  • Feature extraction is an essential preprocessing stage of digital speech processing systems serving several applications such as automatic speech recognition (ASR), speaker identification, speech prosody analysis, and many others

  • This paper reports the work of modeling Distinctive phonetic features (DPFs) of Modern Standard Arabic (MSA)

  • The study reported in [10] addressed DPF element extraction of American English using a multilayer perceptron (MLP) that was modeled using Deep Neural Networks (DNNs)

Read more

Summary

INTRODUCTION

Feature extraction is an essential preprocessing stage of digital speech processing systems serving several applications such as automatic speech recognition (ASR), speaker identification, speech prosody analysis, and many others. A DPF vector is a set of binary elements that uniquely describes the articulatory and phonetic properties of phonemes [1]. That is, generating the phoneme /b/ involves vocal folds’ vibration, which is a brain activity that is described by setting the voicing element as ‘‘+’’. The ability of DPFs to describe speech signal contextually and phonetically makes them of great advantage in enhancing systems performance and robustness [4]. Those benefits can be maximized if language-specific studies are conducted.

DPF EXTRACTION
BACKGROUND
PARAMETER FINE-TUNING
Short-time Energy
Binary Voicing
RESULTS AND DISCUSSION
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call