Abstract

There has been growing interest in introducing speech as a new modality into the human-computer interface (HCI). Motivated by the multimodal nature of speech, the visual component is considered to yield information that is not always present in the acoustic signal and enables improved system performance over acoustic-only methods, especially in noisy environments. In this paper, we investigate the usefulness of visual speech information in HCI related applications. We first introduce a new algorithm for automatically locating the mouth region by using color and motion information and segmenting the lip region by making use of both color and edge information based on Markov random fields. We then derive a relevant set of visual speech parameters and incorporate them into a recognition engine. We present various visual feature performance comparisons to explore their impact on the recognition accuracy, including the lip inner contour and the visibility of the tongue and teeth. By using a common visual feature set, we demonstrate two applications that exploit speechreading in a joint audio-visual speech signal processing task: speech recognition and speaker verification. The experimental results based on two databases demonstrate that the visual information is highly effective for improving recognition performance over a variety of acoustic noise levels.

Highlights

  • In recent years there has been growing interest in introducing new modalities into human-computer interfaces (HCIs)

  • During the following years various automatic speechreading systems were developed [8, 9] that demonstrated that visual speech yields information that is not always present in the acoustic signal and enables improved recognition accuracy over audioonly automatic speech recognition (ASR) systems, especially in environments corrupted by acoustic noise and multiple talkers

  • The verification performance is characterized by two error rates computed during the tests: the false acceptance rate (FAR) and the false rejection rate (FRR)

Read more

Summary

INTRODUCTION

In recent years there has been growing interest in introducing new modalities into human-computer interfaces (HCIs). With this motivation much research has been carried out in automatic speech recognition (ASR). Multiple speakers are very hard to separate acoustically [4] To overcome this limitation, automatic speechreading systems, through their use of visual information to augment acoustic information, have been considered. During the following years various automatic speechreading systems were developed [8, 9] that demonstrated that visual speech yields information that is not always present in the acoustic signal and enables improved recognition accuracy over audioonly ASR systems, especially in environments corrupted by acoustic noise and multiple talkers.

PREVIOUS WORK ON VISUAL FEATURE EXTRACTION
Color analysis
Lip region detection
MRF-based lip segmentation
Visual speech features
Visual speech recognition
Audio-visual integration
SPEAKER VERIFICATION
Findings
SUMMARY

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.