Abstract

A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise. The coefficients of the beamformer are kept updated based on the information of the speech events. The information on the speech events is also used by the speech recognizer for extracting the speech segment.

Highlights

  • Detection of speech events is an important issue in automatic speech recognition (ASR) in a real environment with background noise and interferences

  • The detection of the presence or absence of the target speech signal is often important for noise reduction such as adaptive beamformer or spectral subtraction, which can be used as a preprocessor of ASR

  • In the maximum likelihood (ML) adaptive beamformer employed in this paper, the spatial correlation of the noise must be estimated during the absence of the target signal as described later in this paper

Read more

Summary

INTRODUCTION

Detection of speech events is an important issue in automatic speech recognition (ASR) in a real environment with background noise and interferences. The detection of the presence or absence of the target speech signal is often important for noise reduction such as adaptive beamformer (see, e.g., [1]) or spectral subtraction (see, e.g., [2]), which can be used as a preprocessor of ASR. When environmental noise is nonspeech signals, a voice activity detector (VAD) can be used as a target speech detector (see, e.g., [3]) In environments such as offices and homes, the target and interference from sources such as a TV or a radio can be speech signals. The main focus of that paper was detection of the attention of speakers to the terminal In this system, it was assumed that only a single sound event occurs at a single moment.

Sound localization
Human tracking by vision
Basic concept
Bayesian network used for information fusion
Feature vector
Inference of the Bayesian network
Learning of the Bayesian network
Overview of the system
ML beamformer
Speech recognition and model adaptation
Condition
Experiment 1
Experiment 2
Necessity of audio and video information fusion
Findings
Accuracy of estimation
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call