Abstract

Training of two or more tasks over shared representation involves in the Multi-task learning (MTL). In audio-visual automatic speech recognition, multi task learning is applied in the present work. Primary task of MTL is to learn mapping between frame labels obtained from acoustic GMM/HMM model and audio-visual fused features. An auxiliary task which maps frame labels obtained from a visual GMM/HMM model to visual features is combined with frame labels obtained from acoustic model and audio-visual fused features. Results of a base-line hybrid DNN-HMM AVASR model are compared with MTL model which is tested at various levels of babble noise. Results of this paper indicate that MTL is useful at higher level of noise. Comparison with the base-line model, at −3 SNR dB approximate 7%o relative improvement in WER is reported.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call