Abstract
This paper compares unsupervised sequence training techniques for deep neural networks (DNN) for broadcast transcriptions. Recent progress in digital archiving of broadcast content has made it easier to access large amounts of speech data. Such archived data will be helpful for acoustic/language modeling in live-broadcast captioning based on automatic speech recognition (ASR). In Japanese broadcasts, however, archived programs, e.g., sports news, do not always have closed-captions used typically as references. Thus, unsupervised adaptation techniques are needed for performance improvements even when a DNN is used as an acoustic model. In this paper, we compared three unsupervised sequence adaptation techniques: maximum a posteriori (MAP), entropy minimization, and Bayes risk minimization. Experimental results for transcribing sports news programs showed that the best ASR performance is brought about by Bayes risk minimization which reflects information as to expected errors, while comparable results are obtained with MAP, the simplest way of unsupervised sequence adaptation.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have