Abstract

The paper proposed a deep neural network (DNN)-based Mandarin-Tibetan cross-lingual speech synthesis by adopting speaker adaptation training. The initial and the final are used as the speech synthesis units for both Mandarin and Tibetan to train a set of average voice model(AVM) based on DNN from a large Mandarin multi-speaker corpus and a small Tibetan one-speaker corpus. The speaker adaption is adopted to train a set of speaker-dependent DNN models of Mandarin or Tibetan appended with AVM. The Mandarin speech or Tibetan speech is then synthesized by their respective speaker-dependent DNN acoustic models. Both subjective evaluations and objective tests show that synthesized Tibetan speech by the proposed method are not only better than the traditional Hidden Markov Model(HMM)-based method, but also better than the DNN-based Tibetan speech synthesis with only Tibetan training corpus. Mixed Tibetan training speech have little effect on the quality of synthesized Mandarin speech. Therefore, the proposed method can be applied to the speech synthesis of minority language with rare speech resources.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call