Abstract

Automatic speech recognition systems (ASR) have achieved considerable progress in real applications because of skilled design of the architecture with advanced techniques and algorithms. However, how to design a system efficiently integrating these various techniques to obtain advanced performance is still a challenging task. In this paper, we introduced an ensemble model combination and adaptation based ASR system with two characteristics: (1) large-scale combination of multiple ASR systems based on a Recognizer Output Voting Error Reduction (ROVER) system, and (2) multi-pass unsupervised speaker adaptation for deep neural network acoustic models and topic adaptation on language model. The multiple acoustic models were trained with different acoustic features and model architectures which helped to provide complementary and discriminative information in the ROVER process. With these multiple acoustic models, a better estimation of word confidence could be obtained from ROVER process which helped in selecting data for unsupervised adaptation on the previously trained acoustic models. The final recognition result was obtained using multi-pass decoding, ROVER, and adaptation processes. We tested the system on lecture speeches with topics related to Technology, Entertainment and Design (TED) that were used in the international workshop on spoken language translation (IWSLT) evaluation campaign, and obtained 6.5%, 7.0%, 10.6%, and 8.4% word error rates for test sets in 2011, 2012, 2013, and 2014, which to our knowledge are the best results for these evaluation sets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.