Voice adaptation by color-encoded frame matching as a multi-objective optimization problem for future games

Mads Midtlyng,Hiroshi Hosobe,Yuji Sato

doi:10.1007/s40747-021-00604-6

Abstract

Voice adaptation is an interactive speech processing technique that allows the speaker to transmit with a chosen target voice. We propose a novel method that is intended for dynamic scenarios, such as online video games, where the source speaker’s and target speaker’s data are nonaligned. This would yield massive improvements to immersion and experience by fully becoming a character, and address privacy concerns to protect against harassment by disguising the voice. With unaligned data, traditional methods, e.g., probabilistic models become inaccurate, while recent methods such as deep neural networks (DNN) require too substantial preparation work. Common methods require multiple subjects to be trained in parallel, which constraints practicality in productive environments. Our proposal trains a subject nonparallel into a voice profile used against any unknown source speaker. Prosodic data such as pitch, power and temporal structure are encoded into RGBA-colored frames used in a multi-objective optimization problem to adjust interrelated features based on color likeness. Finally, frames are smoothed and adjusted before output. The method was evaluated using Mean Opinion Score, ABX, MUSHRA, Single Ease Questions and performance benchmarks using two voice profiles of varying sizes and lastly discussion regarding game implementation. Results show improved adaptation quality, especially in a larger voice profile, and audience is positive about using such technology in future games.

Highlights

Voice adaptation (VA) is the speech processing technique [1,2,3,4,5] of translating a spoken message from a source speaker into the voice of a target speaker while retaining prosodic features
The scoring is done for each stimuli provided and is calculated as the arithmetic mean for N subjects. 50 samples of varying length were presented
We presented a novel method to perform voice adaptation by encoding speech features into colored frames that are used in a multi-objective optimization problem to find an ideal target frame depending on the colors of a given input frame

Summary

Introduction

Voice adaptation (VA) is the speech processing technique [1,2,3,4,5] of translating a spoken message from a source speaker into the voice of a target speaker while retaining prosodic features. Prosodic information can be divided into many variables, such as the pitch of the voice, loudness, voice quality and more, giving our speech emotion and variance. This process allows a user to com-. Nonparallel, trains a single subject into a mappable set of data that can be looked up against an unrelated speaker despite varying corpora This has seen some use in the past by construction of pseudo data sets for pairs of source and target speakers, or transformation of utterings by utilizing existing parallel data sets with separate utterances that are paired by estimation models, or by estimating phonemic content correspondingly per active speaker. In this paper, related and contending methods are first presented, the proposed method and its supporting methods are detailed as well as multi-objective optimization problems, and lastly evaluation and observations

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Voice adaptation by color-encoded frame matching as a multi-objective optimization problem for future games

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complex & Intelligent Systems

Lead the way for us

Journal: Complex & Intelligent Systems	Publication Date: Jan 4, 2022
License type: open-access

Similar Papers

Spectral conversion using deep neural networks trained with multi-source speakers
Li-Juan Liu ... Zhen-Hua Ling
-
Li-Juan Liu, et. al.Li-Juan Liu ... Zhen-Hua Ling
01 Apr 2015
01 Apr 2015

Voice Conversion Through Residual Warping in a Sparse, Anchor-Based Representation of Speech
Christopher Liberatore ... Ricardo Gutierrez-Osuna
-
Christopher Liberatore, et. al.Christopher Liberatore ... Ricardo Gutierrez-Osuna
01 Apr 2018
01 Apr 2018

A Statistical Prosodic Model for Voice Conversion
Jan Schwarz ... Ulrich Heute
The Journal of the Acoustical Society of America | VOL. 123
Jan Schwarz, et. al.Jan Schwarz ... Ulrich Heute
01 May 2008
The Journal of the Acoustical Society of America | VOL. 123

Lightweight Multi-objective Voice Adaptation for Real-time Speech Interaction Applied in Games
Mads Midtlyng ... Yuji Sato
-
Mads Midtlyng, et. al.Mads Midtlyng ... Yuji Sato
01 Aug 2020
01 Aug 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Voice adaptation by color-encoded frame matching as a multi-objective optimization problem for future games

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complex & Intelligent Systems