Improving Arabic HMM based speech synthesis quality

Ossama Abdel-Hamid,Mohsen Rashwan,Sherif Mahdy Abdou

doi:10.21437/interspeech.2006-390

Ossama Abdel-Hamid, Mohsen Rashwan + Show 1 more

https://doi.org/10.21437/interspeech.2006-390

Copy DOI

Export

Save

Cite

Publication Date: Sep 17, 2006

Citations: 15

Abstract
Full-Text
Similar Papers

Abstract

Listen

Abstract HMM based speech synthesis, where speech parameters are generated directly from HMM models, is a new technique relative to other speech synthesis techniques. In this paper, we propose some modifications to the basic system to improve its quality. We apply a multi-band excitation model. And we use samples extracted from the spectral envelop as spectral parameters. In the synthesis, the voiced and unvoiced speech parts are mixed according to bands voicing parameters. The voiced part is generated based on a harmonic sinusoidal model. Experimental tests performed on Arabic dataset show that the applied modifications improved the quality. Index Terms : speech synthesis, HMM, MBE. 1. Introduction HMM based speech synthesis [1] is a new technique relative to other synthesis techniques, and it seems promising. In this technique, HMM models are used to simultaneously model different speech parameters. Then to synthesize speech, parameters are generated from these HMM models according to the input text, then speech is synthesized from these parameters. The basic system is similar to the system described in [2]. It was found that it had some problems in the synthesized speech quality. This degraded quality comes from the usage of features similar to the ones used in speech recognition. In speech recognition it's desired to get rid of the small details that differentiate a user from another, and only keep as little information as possible that discriminate between different phonemes. The case is different in speech synthesis, where it's desired to generate synthesized speech with the full details of the original voice. Also using a hard decision for the frame on being either voiced or not is a limiting factor in the system, as some phonemes are of mixed excitation type, and on most phonemes there is some noise on the speech signal. So considering the excitation as either pulse train with no noise or white noise with no periodicity is not suitable. So it's more suitable to represent excitation not only by one voicing parameter, but using a mixed-excitation technique. In our proposed approach, we use HMM to model more detailed speech parameters set, in order to increase the output speech quality. The used speech parameters include voicing of each band to apply an MBE (Multi-Band Excitation) technique [3], where the frame bandwidth is divided into a number of sub-bands, and each band is marked as either voiced or unvoiced. Also instead of using mel-cepstral coefficients to represent spectral envelop, we use a fixed number of spectral envelop samples which are modeled directly in the HMM models. In synthesis, the voiced and unvoiced speech parts are mixed according to bands voicing parameters. The voiced part is generated based on a harmonic sinusoidal model [4][5][6]. While the unvoiced part is generated as a filtered white noise. We applied the modified HMM based speech synthesis system to Arabic language. Arabic has the problem of diacretization, where input text should be diacretized to be able to convert the text into phonemes sequence. The Arabic language analysis and diacretization is based on RDI® language analysis tools for Arabic, and is out of scope of this paper. In the following sections, the basic HMM based speech synthesis system is presented in section 2, then the modified system is described in section 3. Results are presented in section 4, and conclusion is presented in section 5. Figure 1.

Full Text