Abstract

We propose two approaches to handle speech recognition for task domains with sparse matched training data. One is an active learning method that selects training data for the target domain from another general domain that already has a significant amount of labeled speech data. This method uses attribute-disentangled latent variables. For the active learning process, we designed an integrated system consisting of a variational autoencoder with an encoder that infers latent variables with disentangled attributes from the input speech, and a classifier that selects training data with attributes matching the target domain. The other method combines data augmentation methods for generating matched target domain speech data and transfer learning methods based on teacher/student learning. To evaluate the proposed method, we experimented with various task domains with sparse matched training data. The experimental results show that the proposed method has qualitative characteristics that are suitable for the desired purpose, it outperforms random selection, and is comparable to using an equal amount of additional target domain data.

Highlights

  • Deep neural networks (DNN) have been widely adopted and applied to traditional pattern recognition applications, such as speech and image recognition

  • We evaluated the performance of the proposed active learning method using latent variables with disentangled attributes

  • Domain adaptation approaches require a considerable amount of domain speech data with transcription

Read more

Summary

Introduction

Deep neural networks (DNN) have been widely adopted and applied to traditional pattern recognition applications, such as speech and image recognition. DNN-based acoustic models have significantly improved speech recognition performance [1,2]. Studies focused on acoustic models based on the deep neural network-hidden Markov model (DNN-HMM). End-to-end speech recognition, which completely replaces HMM with DNN, has become the focus. It has been adopted for many commercialized speech recognition systems [3,4]. DNN-based acoustic models, especially end-to-end models, use more parameters than conventional HMM-based models and require massive amounts of training data for high performance

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.