On a Voice Conversion by using Prosodic Control

Hernsoo Hahn,Jongkuk Kim,Min Cheol Hong

doi:10.2991/icacsei.2013.117

Abstract

Voice conversion is a method that aims to transform the input speech signal such that the output signal will be perceived as produced by another speaker .Speech synthesizers using voice conversion technologies allow developers to create more voices from a single database and users to personalize the synthesizer to speak with any desired voice after a training period. In this paper, we present the method that converts time and pitch scaling using spectral mapping and PSOLA technique with OLA. This new synthesis scheme allows very flexible modifications of the pitch-scale, the time-scale and the spectral envelope characteristics while producing high-quality speech output. This synthesis scheme is thus well suited to voice conversion. Further work will be conducted on a matching method to correspond well with each phonetic information, and larger corpora to assess the robustness of the method. Index Terms - POSLA, Voice conversion, Prosodic, DTW, Mapping, Pitch, Modification 1. Speech Analysis Voice conversion is a method that aims to transform the input (source) speech signal such that the output (transformed) signal will be perceived as produced by another(target) speaker. Controlling the synthesized speech is one of the most important issues in extending the application fields of TTS 1 systems. Especially, in amusement and education applications, generating multi-speaker's speech in good quality is strongly required. the parameter controlling and the parameter mapping had been the two major approaches of the voice transformation for speech synthesis. A perfect voice transformation system should simulate the modifications of vocal-tract characteristics, prosody and glottal excitation. This task is clearly beyond the capability of current speech knowledge and technology. Simulations of changes in prosodic strategy are difficult to implement and are currently out of the scope of this study. We will mainly put the stress on the modifications of the acoustic parameters. In particular, we will focus on a technique which simulates speaker transformation by mapping the acoustic space of one speaker onto the acoustic space of another. Speaker characteristics will be specified through training. Our method differs from techniques proposed previously by two major aspects: First, we use the PSOLA synthesis framework 4

Full Text