Keyword recognition is the basis of speech recognition, and its application is rapidly increasing in keyword spotting, robotics, and smart home surveillance. Because of these advanced applications, improving the accuracy of keyword recognition is crucial. In this paper, we proposed voice conversion (VC) - based augmentation to increase the limited training dataset and a fusion of a convolutional neural network (CNN) and long-short term memory (LSTM) model for robust speaker-independent isolated keyword recognition. Collecting and preparing a sufficient amount of voice data for speaker-independent speech recognition is a tedious and bulky task. To overcome this, we generated new raw voices from the original voices using an auxiliary classifier conditional variational autoencoder (ACVAE) method. In this study, the main intention of voice conversion is to obtain numerous and various human-like keywords’ voices that are not identical to the source and target speakers’ pronunciation. Parallel VC was used to accurately maintain the linguistic content. We examined the performance of the proposed voice conversion augmentation techniques using robust deep neural network algorithms. Original training data, excluding generated voice using other data augmentation and regularization techniques, were considered as the baseline. The results showed that incorporating voice conversion augmentation into the baseline augmentation techniques and applying the CNN-LSTM model improved the accuracy of isolated keyword recognition.
Read full abstract