Abstract

It has been shown that by combining the acoustic and articulatory information significant performance improvements in automatic speech recognition (ASR) task can be achieved. In practice, however, articulatory information is not available during recognition and the general approach is to estimate it from the acoustic signal. In this paper, we propose a different approach based on the generalized distillation framework, where acoustic-articulatory inversion is not necessary. We trained two DNN models: one called “teacher” learns from both acoustic and articulatory features and the other one called “student” is trained on acoustic features only, but its training process is guided by the “teacher” model and can reach a better performance that can't be obtained by regular training even without articulatory feature inputs during test time. The paper is organized as follows: Section 1 gives the introduction and briefly discusses some related works. Section 2 describes the distillation training process, Section 3 describes ASR system used in this paper. Section 4 presents the experiments and the paper is concluded by Section 5.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call