Cross-lingual speaker adaptation for HMM-based speech synthesis considering differences between language-dependent average voices

Xianglin Peng,Yoshihiko Nankaku,Keiichi Tokuda,Keiichiro Oura

doi:10.1109/icosp.2010.5656849

Abstract

This paper proposes an improved cross-lingual speaker adaptation technique with considering the differences between language-dependent average voices in a Speech-to-Speech Translation system. A state mapping based method had been introduced for cross-lingual speaker adaptation in HMM-based speech synthesis. In this method, the transforms estimated from the input language are applied to average voice models of the output language according to the state mapping information. However, the differences between average voices in the input and output language may degrade the adaptation performance. To reduce the differences, we apply a global linear transform to output average voice models, which minimizes the symmetric Kullback-Leibler divergence between two average voice models. From the experimental results, our approach could not obtain a better result than the original state mapping based method. This is because the global transform affects not only speaker characteristics but also language identity in acoustic features, and this degrades the synthetic speech quality. Therefore, it becomes clear that a technique which separate speaker and language identities is required.

Full Text