Abstract

This work proposes a technique for predicting the pitch from Mel-frequency cepstral coefficients (MFCC) vectors. Previous pitch prediction methods are based on the statistical models such as Gaussian mixture models and hidden Markov models. In this paper, we propose a three-step method to estimate pitch from MFCC vectors. First the Mel-filterbank energies (MFBEs) are estimated from MFCC vectors. Secondly, we propose a novel method to estimate the spectrum from MFBE that exploits the sparse nature of the voiced speech spectrum. Finally, the pitch is estimated from the recovered spectrum. We also explore the effect of different levels of truncation of the discrete cosine transformation (DCT) coefficients in MFCC computation on the pitch prediction error. We use the deep neutral network (DNN) based predictor as baseline to predict the pitch from MFCC vectors. The experiments using CMU-ARCTIC and KEELE database show that the proposed three-step method generalizes better across databases and genders resulting in a drop of ∼8Hz and ∼5Hz in average RMSE of predicted pitch with respect to those from DNN when 13-dimensional and 26-dimensional MFCC vectors are used for pitch prediction respectively. We also find that the sparsity constraint performs better in recovering the spectrum at lower pitch values.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call