Abstract

This paper describes a method for distilling a recurrent-based sequence-to-sequence (S2S) voice conversion (VC) model. Although the performance of recent VCs is becoming higher quality, streaming conversion is still a challenge when considering practical applications. To achieve streaming VC, the conversion model needs a streamable structure, a causal layer rather than a non-causal layer. Motivated by this constraint and recent advances in S2S learning, we apply the teacher-student framework to recurrent-based S2S- VC models. A major challenge is how to minimize degradation due to the use of causal layers which masks future input information. Experimental evaluations show that except for male-to-female speaker conversion, our approach is able to maintain the teacher model's performance in terms of subjective evaluations despite the streamable student model structure. Audio samples can be accessed on http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/dists2svc.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.