Abstract
ABSTRACTThis paper describes a method of extracting time-varyingfeatures that is effective for speech signals with high funda-mental frequencies. The proposed method adopts a speechproduction model that consists of a Time-Varying Auto-Regressive (TVAR) process for an articulatory filter and aHidden Markov Model (HMM) for an excitation source.The model represents waveform amplitude variations bytime-varying gain of the excitation source. The proposed algo-rithm is given by extending a Viterbi algorithm so that theproposed algorithm can adaptively estimate TVAR coeffi-cients and time-varying gain with decoding the state tran-sition of the excitation source HMM. We applied the pro-posed method to extracting time-varying features from bothsynthetic and natural speech, and confirmed its feasibility.1. INTRODUCTIONThe conventional Linear Prediction (LP) method is widelyused to analyze speech signals[1]. However, several prob-lems still remain to be solved[2]. One such problem is thatlocal peaks of the LP spectral estimate are strongly biasedtoward the harmonics, especially for high-pitched speech.Several methods have been designed to overcome this prob-lem [3, 4, 5, 6]. The authors have previously indicated thatan analysis method based on a speech production modelconsisting of an Auto-Regressive (AR) process for an artic-ulatoryfilterand a HiddenMarkov Model(HMM)for an ex-citationsource is robustforhighfundamental frequencies[7,8]. However, this method is not suitable for analyzing con-tinuous speech for the following reasons. First, the AR co-efficients and HMM are iteratively estimated within everyanalysis frame, so a large number of operations is needed.Second, the analysis frame size needs to be large in orderto guarantee stable learning of the excitation source HMM.Third, the model parameters are assumed to be constantwithin the analysis frame, so the resulting parameters areaveraged within such a long analysis frame. This makesit difficult to extract the dynamic characteristics of speechwhen features change rapidly, like in a singing voice.In this paper, we extend the speech production model in[7] so that the proposed model can represent time-varyingfeatures of continuous speech. We also describe an anal-ysis method that adaptively estimates Time-Varying Auto-Regressive (TVAR) coefficients and gain based on the newmodel. The proposed method can substantially reduce thenumber of operations by applying the learned HMM andcan also extract dynamic characteristics of continuous speechby estimating those time-varying features adaptively.2. SPEECH PRODUCTION MODEL BASED ONTVAR-HMMThe proposed method adopts a speech production modelthat consists of a TVAR process for an articulatory filterand an HMM for an excitation source. The nodes of theHMM are concatenated in a ring state in order to representperiodicity of voiced sounds. We have previously shownthat LP analysis incorporating the excitation source HMMcan precisely estimate the characteristics of both vocal tractand excitation source from high-pitched speech signal[7, 8].The proposed model represents the time-varying features ofnot only the vocal tract but also the waveform amplitude bymultiplying an excitation source emitted from the HMM bya time-varying gain.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.