Abstract

The paper presents a speech generative model that provides an efficient way of generating speech waveform from its amplitude spectral envelopes. The model is based on hybrid speech representation that includes deterministic (harmonic) and stochastic (noise) components. The main idea behind the approach originates from the fact that speech signal has a determined spectral structure that is statistically bound with deterministic/stochastic energy distribution in the spectrum. The performance of the model is evaluated using an experimental low-bitrate wide-band speech coder. The quality of reconstructed speech is evaluated using objective and subjective methods. Two objective quality characteristics were calculated: Modified Bark Spectral Distortion (MBSD) and Perceptual Evaluation of Speech Quality (PESQ). Narrow-band and wide-band versions of the proposed solution were compared with MELP (Mixed Excitation Linear Prediction) speech coder and AMR (Adaptive Multi-Rate) speech coder, respectively. The speech base of two female and two male speakers were used for testing. The performed tests show that overall performance of the proposed approach is speaker-dependent and it is better for male voices. Supposedly, this difference indicates the influence of pitch highness on separation accuracy. In that way, using the proposed approach in experimental speech compression system provides decent MBSD values and comparable PESQ values with AMR speech coder at 6,6 kbit/s. Additional subjective listening testsdemonstrate that the implemented coding system retains phonetic content and speaker’s identity. It proves consistency of the proposed approach.

Highlights

  • Contemporary speech synthesis algorithms have made a great leap forward due to developing of artificial neural networks

  • The separation function is estimated through a training procedure that involves fitting of data obtained through instantaneous harmonic analysis and short time spectrum

  • The model involves deterministic/stochastic decomposition that is carried out using separation function without conventional harmonic analysis

Read more

Summary

Introduction

Contemporary speech synthesis algorithms have made a great leap forward due to developing of artificial neural networks. Taha M., Azarov E.S., Likhachov D.S., Petrovsky A.A. An efficient speech generative model based on deterministic/stochastic separation of spectral envelopes. The algorithm utilizes Harmonic plus Noise Model (HNM) and statistical deterministic/stochastic separation of the envelopes. The separation function is estimated through a training procedure that involves fitting of data obtained through instantaneous harmonic analysis and short time spectrum.

Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.