Abstract

Voice cloning is a technique to build text-to-speech applications for individuals. When only very limited training data is available, it is challenging to preserve both high speech quality and high speaker similarity. We propose a neural fusion architecture to incorporate a unit concatenation method into a parametric text-to-speech model to address this issue. Unlike the hybrid unit concatenation system, the proposed fusion architecture is still an end-to-end neural network model. It consists of a text encoder, an acoustic decoder, and a phoneme-level reference encoder. The reference encoder extracts phoneme-level embeddings corresponding to the cloning audio segments, and the text encoder infers phoneme-level embeddings from the input text. One of the two embeddings is then selected and sent to the decoder. We use auto-regressive distribution modeling and decoder refinement after the selection stage to overcome the concatenation discontinuity problem. Experimental results show that the neural fusion system significantly improves the speaker similarity using the selected units with the highest probability. The speech naturalness remains similar to the directly decoded systems.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.