Abstract

Traditional stereophonic acoustic echo cancellation algorithms need to estimate acoustic echo paths from stereo loudspeakers to a microphone, which often suffers from the nonuniqueness problem caused by a high correlation between the two far-end signals of these stereo loudspeakers. Many decorrelation methods have already been proposed to mitigate this problem. However, these methods may reduce the audio quality and/or stereophonic spatial perception. This paper proposes to use a convolutional recurrent network (CRN) to suppress the stereophonic echo components by estimating a nonlinear gain, which is then multiplied by the complex spectrum of the microphone signal to obtain the estimated near-end speech without a decorrelation procedure. The CRN includes an encoder-decoder module and two-layer gated recurrent network module, which can take advantage of the feature extraction capability of the convolutional neural networks and temporal modeling capability of recurrent neural networks simultaneously. The magnitude spectra of the two far-end signals are used as input features directly without any decorrelation preprocessing and, thus, both the audio quality and stereophonic spatial perception can be maintained. The experimental results in both the simulated and real acoustic environments show that the proposed algorithm outperforms traditional algorithms such as the normalized least-mean square and Wiener algorithms, especially in situations of low signal-to-echo ratio and high reverberation time RT60.

Highlights

  • In practical hands-free teleconferencing systems, a stereophonic communication system is often necessary to provide a realistic experience that a single-channel system cannot offer

  • This paper proposes to use a convolutional recurrent network (CRN) to suppress the stereophonic echo components by estimating a nonlinear gain, which is multiplied by the complex spectrum of the microphone signal to obtain the estimated near-end speech without a decorrelation procedure

  • A deep learning technology is proposed to solve the SAEC problem because deep neural networks (DNNs) can efficiently model the nonlinear relationship between the high dimensional vectors, which can remove both the decorrelation procedure and double-talk detectors (DTDs) in traditional SAEC algorithms

Read more

Summary

INTRODUCTION

In practical hands-free teleconferencing systems, a stereophonic communication system is often necessary to provide a realistic experience that a single-channel system cannot offer. Traditional SAEC methods usually suppress the echo by estimating the acoustic echo paths between the stereophonic loudspeakers and microphones using adaptive filters In such a case, two echo paths need to be identified for each microphone because there are two far-end signals. A commonly used way to mitigate the nonuniqueness problem in adaptive filtering-based SAEC algorithms is to decorrelate the two far-end signals (Benesty et al, 1999). A modified SAES method incorporating the spectral and temporal correlations in the STFT domain was proposed in Lee et al (2014), which considered the adjacent time-frequency (TF) bins of far-end signals Note that these Wiener filter-based SAES algorithms can suppress the echo signal by the estimated gain function in each TF bin directly, they do not need to estimate the two echo paths of the SAEC exactly.

SIGNAL MODEL
PROPOSED CRN-BASED SAES
Feature extraction
Training targets and signal reconstruction
Model architecture
Experiment setting
16 Â T Â 80 32 Â T Â 39 64 Â T Â 19 128 Â T Â 9 256 Â T Â 4
Performance evaluation in high SNR scenarios
Performance evaluation in low SNR scenarios
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call