Abstract

A lot of interest has been risen in the last years on the adaptation of deep neural network (DNN) acoustic models, as the latter become the state-of-art in automatic speech recognition. This work focuses on approaches that allow for rapid and robust adaptation of such models. First, i-vectors are added to the DNN input as speaker-informed features. An informative prior is introduced to i-vector estimation to improve the robustness to limited adaptation data. I-vectors are then combined with a structured adaptive DNN, the multibasis adaptive neural network (MBANN), and the complementarity of these adaptation techniques is investigated. Moreover, i-vectors are used to predict the MBANN transforms, avoiding the initial decoding pass and alignment. These approaches are evaluated on a U.S. English Broadcast News (BN) transcription task with two distinct sets of test data. The first, from the BN task and BN-style Youtube videos, yields test data acoustically matched to the training data, while the second set is from acoustically mismatched Youtube videos of diverse context. The performance gains from these schemes are found to be sensitive to the level of mismatch between training and test sets. The MBANN system combined with i-vector input achieves best performance for BN test sets. The i-vector-based predictive MBANN scheme is proven to be more robust to acoustically mismatched conditions and outperforms the other adaptation schemes in such scenarios.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call