Abstract

This paper proposes a novel voice adaptation method that we applied to interactive activities such as games where source and target data are unaligned. Conventional methods have seen the use of probabilistic models or more recently, Deep Neural Networks. Common for most methods is that they require multiple subjects to train in conjunction, thus voice adaptation is not practical to be used in commercial applications. We propose a method which convert audible frequencies to light spectrum simple RGB color format, and not comparing sound signal similarities, but rather likeness in color. The comparison is done using multi-objective optimization which considers raw and normalized frame colors as two separate objectives to be evaluated, respectively audible and spectral structure. The distance for the objectives is used to select an ideal output frame. Finally, prosodic information such as speech intensity is translated from measured input values onto the designated output frame. The method is evaluated using MOS, ABX, performance benchmark and lastly implemented into the Unity3D game engine as a proof of concept. Results show good sound quality and high performance with little output fragmentation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.