Abstract

The framework of voice conversion system is expected to emphasize both the static and dynamic characteristics of the speech signal. The conventional approaches like Mel frequency cepstrum coefficients and linear predictive coefficients focus on spectral features limited to lower frequency bands. This paper presents a novel wavelet packet filter bank approach to identify non-uniformly distributed dynamic characteristics of the speaker. Contribution of this paper is threefold. First, in the feature extraction stage, dyadic wavelet packet tree structure is optimized to involve less computation while preserving the speaker-specific features. Second, in the feature representation step, magnitude and phase attributes are treated separately to rule out on the fact that raw time-frequency traits are highly correlated but carry intelligent speech information. Finally, the RBF mapping function is established to transform the speaker-specific features from the source to the target speakers. The results obtained by the proposed filter bank-based voice conversion system are compared to the baseline multiscale voice morphing results by using subjective and objective measures. Evaluation results reveal that the proposed method outperforms by incorporating the speaker-specific dynamic characteristics and phase information of the speech signal.

Highlights

  • The voice conversion (VC) system aims to apply various modifications to the source speaker’s voice so that the converted signal sounds like a particular target speaker’s voice [1,2]

  • In order to extract the speaker-specific features, several speech feature representations have been developed in the literature, such as Formant Frequencies (FF) [1,4], Linear Predictive Coefficients (LPC) [1,5] and Line Spectral Frequency (LSF) [6-8], Mel Frequency Cepstrum Coefficient (MFCC) [9], Mel Generated Cepstrum (MGC) [10], and spectral lines [11]

  • 6 Conclusion In this article, a new feature extraction approach based on admissible wavelet packet transform has been proposed

Read more

Summary

Introduction

The voice conversion (VC) system aims to apply various modifications to the source speaker’s voice so that the converted signal sounds like a particular target speaker’s voice [1,2]. The different frequency bands with the speaker-specific features are further decomposed to get finer resolution than the Mel filter bank [24,25]. The aligned magnitude and phase feature vectors of source and target speakers are used to train the separate RBF-based transformation model to establish conversion rules. The test utterances of source speaker are pre-processed in the similar way as the training stage to get the separate feature vectors for magnitude and phase information of filtered coefficients. The dyadic wavelet filter bank applied to the source and target speech frames partitions each of the frames into different frequency bands. The transformation phase employs the RBF-based mapping rules developed in the training stage to obtain the morphed features of the target speaker [27].

Performance index
Formant frequencies
Totally same
Baseline algorithm
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call