SVCGAN: Speaker Voice Conversion Generative Adversarial Network for Children’s Speech Conversion and Recognition

Chenghuan Xie, Aimin Zhou

doi:10.52783/jes.1841

Abstract

Automatic speech recognition (ASR) refers to a technological process that entails the conversion of spoken language into written text. However, the acoustic distinctions between children’s speech and adult speech are substantial, rendering the automatic speech recognition system trained on adult speech inadequate for effectively recognizing children’s speech. To overcome this issue, in this study, we propose speaker conversion generative adversarial network (SVCGAN). SVCGAN is a novel non-parallel voice conversion model, which enhances three key areas: log-cosh loss, semantic-similarity loss, and third adversarial loss. Therefore, the incorporation of these losses better protects semantic information for young children during voice conversion process and improves the quality of the converted speech. Additionally, the character error rate (CER) of children’s speech recognition can benefit from children’s speech transformed into adult speech. Experimental results suggest that SVCGAN demonstrates superior performance across multiple dimensions compared to both CycleGAN-VC3 and MaskCycleGAN-VC models. It encompasses training efficiency, semantic information similarity, voice type similarity, sound naturalness and intelligibility, which leads to a reduction in the CER of speech recognition for young children.

Full Text