Abstract

Voice Conversion (VC) is a technique to convert speech data of source speaker into ones of target speaker. VC has been investigated and statistical VC is used for various purposes. Conventional VC uses acoustic features, however, the audio-only VC has suffered from the degradation in noisy or real environments. This paper proposes an AudioVisual VC (AVVC) method using not only audio features but also visual information, i.e. lip images. Eigenlip feature is employed in our scheme as visual feature. We also propose a feature selection approach for audio-visual features. Experiments were conducted to evaluate our AVVC scheme comparing with audio-only VC, using noisy data. The results show that AVVC can improve the performance even in noisy environments, by properly selecting audio and visual parameters. It is also found that visual VC is also successful. Furthermore, it is observed that visual dynamic features are more effective than visual static information.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.