Abstract

There has been a long term interest in using visual information to improve automatic speech recognition (ASR) system performance. Both audio and visual information are required in conventional audio visual speech recognition (AVSR) systems. This limits their wider applications when visual modality is not present. To this end, one possible solution is to use acoustic-to-visual (A2V) inversion techniques to generate visual features. Previous research in this direction used synthetic acoustic-articulatory parallel data in inversion model training. The acoustic mismatch between the audio-visual (AV) parallel data and target data was not considered. In addition, the target language to apply these technologies has been focused on English. In this article, a real 3D Audio-Visual Mandarin Continuous Speech (3DAV-MCS) corpus was used to train deep neural network based A2V inversion models. Cross-domain adaptation of the inversion models allows suitable visual features to be generated from acoustic data of mismatched domains. The proposed cross-domain deep visual feature generation techniques were evaluated on two state-of-the-art Mandarin speech recognition tasks: DAPRA GALE broadcast transcription and BOLT conversational telephone speech recognition. The AVSR systems constructed using the cross-domain generated visual features consistently outperformed the baseline convolutional neural network (CNN) ASR systems by up to 3.3% absolute (9.1% relative) character error rate (CER) reductions after both speaker adaptive training and sequence discriminative training were performed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call