Prosody Modification of Standard Arabic Speech Using Combining Synchronous Overlap and Add With Fixed-Synthesis Algorithm and Multi Level Discrete Wavelet Transform

Abdel Abdel

doi:10.3844/jcssp.2010.392.405

Abstract

Problem statement: The objective of prosody modification is to change the amplitude, duration and pitch (F0) of speech segments without altering their spectral envelop. Applications are numerous, including, Text-To-Speech synthesis, transformation of voice characteristics and foreign language learning. Several approaches have been developed in the literature to achieve this goal. The main restrictions of these latter are in the modification range, the synthesized speech quality and naturalness of spoken language. The latest research studies provide evidence that the first Formant (F1) and F0 are dependent; suggesting that in order to preserve high quality and naturalness of the speech signal, any change to one of these parameters must be accompanied by a suitable modification of the other. Approach: This study introduced a prosody modification method using combining Synchronous Overlap and Add with Fixed-Synthesis (SOLAFS) algorithm and a multi level decomposition based on Discrete Wavelet Transform (DWT) to overcome the limitations cited above. It used Standard Arabic (SA) sounds. For a purpose of comparison, two techniques based on frame by frame processing were proposed. The first one consists in a pitch synchronous processing of the mth approximation level time segments used in SOLAFS algorithm. It was aimed to modify the prosody of the input speech without affecting the spectral envelop. The second one explores the correlation between F1 and F0 in the corresponding approximation level of SA sounded and modified duration and both F0 and F1 scales. It was based on a re-sampling method using FFT interpolation. The use of multi level analysis was aimed to provide independent control over the spectral envelope. In both techniques, the decomposition level depends on the chosen sampling Frequency (FS). F0 marking was based on multi level peaks comparison. Both techniques use an automatic speech classification algorithm based on modified version of the Johnson algorithm. Results: The performances of The performances of the proposed techniques were evaluated by listening tests using sentences in SA language sampled at an FS of 16 kHz. It was found that manipulation in the third approximation level of F0 in conjunction with the local F1 improved significantly the naturalness of the modified speech compared to the classical prosody modification. Conclusion: This improvement was most suitable for high F0 scales from the fact that speaker generally increases F1 as they increase their F0. Further, the technique can be used in the manipulation of the remained formant structure.

Highlights

The purpose of prosody modification is to change the amplitude, duration and pitch (F0) of a speech segment without affecting the timbre of the speaker voice
Speech intelligibility is related to the amount of speech items that is recognized correctly, while speech distortion is related to the quality of a reproduce speech signal with respect to the amount of audible distortions
For AL3PR-Synchronous Overlap and Add with Fixed-Synthesis (SOLAFS) technique, high pitch period modification factors leads to acceptable degree of naturalness but in some ways, it sounds like metallic effects with the increment of the modification factors

Summary

Introduction

The purpose of prosody modification is to change the amplitude, duration and pitch (F0) of a speech segment without affecting the timbre of the speaker voice. Amplitude modification can be accomplished by direct multiplication, but duration and F0 changes are not so straightforward (Ykhlef et al, 2008). Speech Synthesis (TTS), transformation of voice characteristics, foreign language learning and audio monitoring or film/soundtrack post-synchronization (Moulines and Laroche, 1995). In a TTS system, it is necessary to modify the durations and F0 contours of the basic units in order to incorporate the relevant suprasegmental knowledge in the utterance corresponding to the sequence of these units (Yegnanarayana et al, 1994). Corresponding Author: Ykhlef Faycal, Multimedia and System Architecture, Center of Development of Advanced Technologies, Algeria 392.

Methods

Results

Discussion

Conclusion