Abstract

We propose robust distant speech recognition by combining multiple microphone-array processing with position-dependent cepstral mean normalization (CMN). In the recognition stage, the system estimates the speaker position and adopts compensation parameters estimated a priori corresponding to the estimated position. Then the system applies CMN to the speech (i.e., position-dependent CMN) and performs speech recognition for each channel. The features obtained from the multiple channels are integrated with the following two types of processings. The first method is to use the maximum vote or the maximum summation likelihood of recognition results from multiple channels to obtain the final result, which is called multiple-decoder processing. The second method is to calculate the output probability of each input at frame level, and a single decoder using these output probabilities is used to perform speech recognition. This is called single-decoder processing, resulting in lower computational cost. We combine the delay-and-sum beamforming with multiple-decoder processing or single-decoder processing, which is termed multiple microphone-array processing. We conducted the experiments of our proposed method using a limited vocabulary (100 words) distant isolated word recognition in a real environment. The proposed multiple microphone-array processing using multiple decoders with position-dependent CMN achieved a 3.2% improvement (50% relative error reduction rate) over the delay-and-sum beamforming with conventional CMN (i.e., the conventional method). The multiple microphone-array processing using a single decoder needs about one-third the computational time of that using multiple decoders without degrading speech recognition performance.

Highlights

  • Automatic speech recognition (ASR) systems are known to perform reasonably well when the speech signals are captured using a close-talking microphone

  • We proposed a robust distant speech recognition system based on position-dependent Cepstral mean normalization (CMN) using multiple microphones

  • The 3D space speaker position could be quickly estimated, and a channel distortion compensation method based on position-dependent CMN was adopted to compensate for the transmission characteristics

Read more

Summary

INTRODUCTION

Automatic speech recognition (ASR) systems are known to perform reasonably well when the speech signals are captured using a close-talking microphone. We propose a robust speech recognition method using a new real-time CMN based on speaker position, which we call position-dependent CMN. The system adopts the compensation parameter corresponding to the estimated position and applies a channel distortion compensation method to the speech (i.e., position-dependent CMN) and performs speech recognition. The maximum vote (i.e., voting method (VM)) or the maximum summation likelihood (i.e., maximum-summationlikelihood method (MSLM)) of all channels is used to obtain the final result [12], which is called multiple-decoder processing This should obtain robust performance in a distant environment. A multiple microphone-array processing using multiple decoders or single decoder is proposed, while Section 5 describes the experimental results of distant speech recognition in a real environment.

SPEAKER POSITION ESTIMATION
Conventional CMN and real-time CMN
Incorporate speaker position information into real-time CMN
Problem and solution
Multiple-decoder processing
Voting method
Maximum-summation-likelihood method
Single-decoder processing
Multiple microphone-array processing
Experimental setup
Recognition experiment for speech emitted by a loudspeaker
Recognition experiment of speech uttered by humans
Experimental results for multiple-microphone speech processing
Findings
CONCLUSION AND FUTURE WORK

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.