Abstract
Voice adaptation is an interactive speech processing technique that allows the speaker to transmit with a chosen target voice. We propose a novel method that is intended for dynamic scenarios, such as online video games, where the source speaker’s and target speaker’s data are nonaligned. This would yield massive improvements to immersion and experience by fully becoming a character, and address privacy concerns to protect against harassment by disguising the voice. With unaligned data, traditional methods, e.g., probabilistic models become inaccurate, while recent methods such as deep neural networks (DNN) require too substantial preparation work. Common methods require multiple subjects to be trained in parallel, which constraints practicality in productive environments. Our proposal trains a subject nonparallel into a voice profile used against any unknown source speaker. Prosodic data such as pitch, power and temporal structure are encoded into RGBA-colored frames used in a multi-objective optimization problem to adjust interrelated features based on color likeness. Finally, frames are smoothed and adjusted before output. The method was evaluated using Mean Opinion Score, ABX, MUSHRA, Single Ease Questions and performance benchmarks using two voice profiles of varying sizes and lastly discussion regarding game implementation. Results show improved adaptation quality, especially in a larger voice profile, and audience is positive about using such technology in future games.
Highlights
Voice adaptation (VA) is the speech processing technique [1,2,3,4,5] of translating a spoken message from a source speaker into the voice of a target speaker while retaining prosodic features
The scoring is done for each stimuli provided and is calculated as the arithmetic mean for N subjects. 50 samples of varying length were presented
We presented a novel method to perform voice adaptation by encoding speech features into colored frames that are used in a multi-objective optimization problem to find an ideal target frame depending on the colors of a given input frame
Summary
Voice adaptation (VA) is the speech processing technique [1,2,3,4,5] of translating a spoken message from a source speaker into the voice of a target speaker while retaining prosodic features. Prosodic information can be divided into many variables, such as the pitch of the voice, loudness, voice quality and more, giving our speech emotion and variance. This process allows a user to com-. Nonparallel, trains a single subject into a mappable set of data that can be looked up against an unrelated speaker despite varying corpora This has seen some use in the past by construction of pseudo data sets for pairs of source and target speakers, or transformation of utterings by utilizing existing parallel data sets with separate utterances that are paired by estimation models, or by estimating phonemic content correspondingly per active speaker. In this paper, related and contending methods are first presented, the proposed method and its supporting methods are detailed as well as multi-objective optimization problems, and lastly evaluation and observations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.