Speech signal modeling using multivariate distributions

Ali Aroudi,Hadi Veisi,Zahra Mafakheri,Hossein Sameti

doi:10.1186/s13636-015-0078-1

Abstract

Using a proper distribution function for speech signal or for its representations is of crucial importance in statistical-based speech processing algorithms. Although the most commonly used probability density function (pdf) for speech signals is Gaussian, recent studies have shown the superiority of super-Gaussian pdfs. A large research effort has focused on the investigation of a univariate case of speech signal distribution; however, in this paper, we study the multivariate distributions of speech signal and its representations using the conventional distribution functions, e.g., multivariate Gaussian and multivariate Laplace, and the copula-based multivariate distributions as candidates. The copula-based technique is a powerful method in modeling non-Gaussian multivariate distributions with non-linear inter-dimensional dependency. The level of similarity between the candidate pdfs and the real speech pdf in different domains is evaluated using the energy goodness-of-fit test. In our evaluations, the best-fitted distributions for speech signal vectors with different lengths in various domains are determined. A similar experiment is performed for different classes of English phonemes (fricatives, nasals, stops, vowels, and semivowel/glides). The evaluation results demonstrate that the multivariate distribution of speech signals in different domains is mostly super-Gaussian, except for Mel-frequency cepstral coefficient. Also, the results confirm that the distribution of the different phoneme classes is better statistically modeled by a mixture of Gaussian and Laplace pdfs. The copula-based distributions provide better statistical modeling of vectors representing discrete Fourier transform (DFT) amplitude of speech vectors with a length shorter than 500 ms.

Highlights

Statistical-based speech processing algorithms have attracted wide interests during the last three decades in numerous applications, e.g., speech coding [1], speech recognition [2, 3], speech synthesis [4], and speech enhancement [5]
– The best-fitted candidate in the sense of the energy test for the T, real parts of DFT (RDFT), imaginary parts of DFT (IDFT), and Discrete cosine transform (DCT) features with frame length of 20, 30, and 100 ms is MLD, despite the often used assumption of multivariate Gaussian distribution in the speech enhancement algorithms [8,9,10], but consistent with the univariate Laplace distribution proposed by Martin [6] and Gazor et al [14]. – The univariate Rayleigh distribution has been proposed for amplitude of DFT (ADFT) feature with a short frame length
Varying the best-fitted distribution for Linear predictive coefficient (LPC) features from MGLD to MGD verifies this contribution, too. – The best-fitted candidate for the Mel-frequency cepstral coefficient (MFCC) with different frame lengths is MGD, consistent with the assumption of multivariate Gaussian distribution used in most speech recognition algorithms [2, 3]

Summary

Introduction

Statistical-based speech processing algorithms have attracted wide interests during the last three decades in numerous applications, e.g., speech coding [1], speech recognition [2, 3], speech synthesis [4], and speech enhancement [5]. There are typically several challenges in the studying and modeling of speech signals in the multivariate distribution case, e.g., the non-linear or linear inter-dimensional dependency, and the sparsity and complexity of the multidimensional space. Traditional hidden Markov model (HMM)-based speech recognition and synthesis algorithms [3, 27] exploit Mel-frequency cepstral coefficients (MFCC); HMM-based speaker recognition [13] systems exploit either linear predictive coding (LPC) or MFCC; HMM-based speech enhancement algorithms use LPC, time, DCT, MFCC, or DFT [7, 9, 10]; and codebook-driven-based speech enhancement algorithms [28] employ LPC All these algorithms assume the multivariate Gaussian pdf for extracted features of speech signals. The purpose of this section is to briefly review the basic definition of the copula and a number of the most commonly used estimation methods for fitting the copula to the real data

Copula model

Gaussian copulas

Student-t copulas

Fit a copula model

N ln c h F

Energy test

A bj ð15Þ

B: Second best-candidate

Conclusions

Additional file

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Dec 1, 2015
Citations: 40	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Speech signal modeling using multivariate distributions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

A Hybrid Approach for Speech Enhancement Using MoG Model and Neural Network Phoneme Classifier
Shlomo E Chazan ... Sharon Gannot
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24
Shlomo E Chazan, et. al.Shlomo E Chazan ... Sharon Gannot
01 Dec 2016
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24

Describing the Conformational Landscape of Small Organic Molecules through Gaussian Mixtures in Dihedral Space.
Pasquale Pisani ... Giovanni Bottegoni
Journal of Chemical Theory and Computation | VOL. 10
Pasquale Pisani, et. al.Pasquale Pisani ... Giovanni Bottegoni
23 May 2014
Journal of Chemical Theory and Computation | VOL. 10

Sequential Simulations of Mixed Discrete-Continuous Properties: Sequential Gaussian Mixture Simulation
Dario Grana ... Laura Dovera
-
Dario Grana, et. al.Dario Grana ... Laura Dovera
01 Jan 2012
01 Jan 2012

Multivariate Modeling with Copulas and Engineering Applications
Jun Yan
-
Jun YanJun Yan
01 Jan 2023
01 Jan 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speech signal modeling using multivariate distributions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing