Abstract

A review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.

Highlights

  • Current advances in microelectronics make efficient processing of audio and video data in computerized mobile devices possible

  • The authors of this study evaluate the performance of a system built of acoustic and visual features and Dynamic Bayesian Network (DBN) models

  • The selfdeveloped automatic speech recognition (ASR) was implemented utilizing Hidden Markov Model Toolkit (HTK) toolkit based on Hidden Markov Models (HMM)

Read more

Summary

Introduction

Current advances in microelectronics make efficient processing of audio and video data in computerized mobile devices possible. Most smartphones and tablet computers are equipped with audio-based speech recognition systems. When those functionalities are used in real environments, the speech signal can become corrupted, negatively influencing speech recognition accuracy (Trentin and Matassoni 2003). Inspired by the human-like multimodal perception of speech described in the literature (e.g. by McGurk 1976), an additional information from the visual modality, usually extracted from a recording of speaker’s lips, can be introduced in order to complement acoustic information and to mitigate the negative impact of audio corruption. The most recent works employ Deep Neural Networks (DNN) (Almajai et al 2016), Mroueh et al (2015) and Convolutional Neural Networks (CNN) (Noda et al 2015) serving as a front-end for audio and visual feature extraction. In the novel approach to visual speech recognition by Chung et al (2016), Convolutional Neural Networks and a processing on the sentence level at both: learning and analysis phase rather than on the phoneme level were employed

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call