Time-frequency Masking Research Articles

Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency of the entire system. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a much shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.

Read full abstract

This paper considers acoustic beamforming for noise robust automatic speech recognition. A beamformer attenuates background noise by enhancing sound components coming from a direction specified by a steering vector. Hence, accurate steering vector estimation is paramount for successful noise reduction. Recently, time-frequency masking has been proposed to estimate the steering vectors that are used for a beamformer. In particular, we have developed a new form of this approach, which uses a speech spectral model based on a complex Gaussian mixture model CGMM to estimate the time-frequency masks needed for steering vector estimation, and extended the CGMM-based beamformer to an online speech enhancement scenario. Our previous experiments showed that the proposed CGMM-based approach outperforms a recently proposed mask estimator based on a Watson mixture model and the baseline speech enhancement system of the CHiME-3 challenge. This paper provides additional experimental results for our online processing, which achieves performance comparable to that of batch processing with a suitable block-batch size. This online version reduces the CHiME-3 word error rate WER on the evaluation set from 8.37% to 8.06%. Moreover, in this paper, we introduce a probabilistic prior distribution for a spatial correlation matrix a CGMM parameter, which enables more stable steering vector estimation in the presence of interfering speakers. In practice, the performance of the proposed online beamformer degrades with observations that contain only noise or/and interference because of the failure of the CGMM parameter estimation. The introduced spatial prior enables the target speaker's parameter to avoid overfitting to noise or/and interference. Experimental results show that the spatial prior reduces the WER from 38.4% to 29.2% in a conversation recognition task compared with the CGMM-based approach without the prior, and outperforms a conventional online speech enhancement approach.

Read full abstract

Time-frequency Masking Research Articles

Related Topics

Articles published on Time-frequency Masking

Speech Enhancement Using Masking for Binaural Reproduction of Ambisonics Signals

Simultaneous Tracking and Separation of Multiple Sources Using Factor Graph Model

Deep Learning for Talker-dependent Reverberant Speaker Separation: An Empirical Study.

Time–Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks

A New Framework for CNN-Based Speech Enhancement in the Time Domain.

The optimal threshold for removing noise from speech is similar across normal and impaired hearing-a time-frequency masking study.

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

A Competing Voices Test for Hearing-Impaired Listeners Applied to Spatial Separation and Ideal Time-Frequency Masks.

Robust Speaker Localization Guided by Deep Learning-Based Time-Frequency Masking

디젤엔진의 연소소음 음질평가용 인덱스 개발

A deep learning based segregation algorithm to increase speech intelligibility for hearing-impaired listeners in reverberant-noisy conditions.

Bootstrap Averaging for Model-Based Source Separation in Reverberant Conditions

Perceptual and Model-Based Evaluation of Ideal Time-Frequency Noise Reduction in Hearing-Impaired Listeners

Binaural Classification-Based Speech Segregation and Robust Speaker Recognition System

Impact of phase estimation on single-channel speech separation based on time-frequency masking.

Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising.

Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR

A blind source separation approach for humpback whale song separation.

Associative Memory Model-Based Linear Filtering and Its Application to Tandem Connectionist Blind Source Separation

Decoding EEG in Cognitive Tasks With Time-Frequency and Connectivity Masks

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Time-frequency Masking Research Articles

Related Topics

Articles published on Time-frequency Masking

Speech Enhancement Using Masking for Binaural Reproduction of Ambisonics Signals

Simultaneous Tracking and Separation of Multiple Sources Using Factor Graph Model

Deep Learning for Talker-dependent Reverberant Speaker Separation: An Empirical Study.

Time–Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks

A New Framework for CNN-Based Speech Enhancement in the Time Domain.

The optimal threshold for removing noise from speech is similar across normal and impaired hearing-a time-frequency masking study.

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

A Competing Voices Test for Hearing-Impaired Listeners Applied to Spatial Separation and Ideal Time-Frequency Masks.

Robust Speaker Localization Guided by Deep Learning-Based Time-Frequency Masking

디젤엔진의 연소소음 음질평가용 인덱스 개발

A deep learning based segregation algorithm to increase speech intelligibility for hearing-impaired listeners in reverberant-noisy conditions.

Bootstrap Averaging for Model-Based Source Separation in Reverberant Conditions

Perceptual and Model-Based Evaluation of Ideal Time-Frequency Noise Reduction in Hearing-Impaired Listeners

Binaural Classification-Based Speech Segregation and Robust Speaker Recognition System

Impact of phase estimation on single-channel speech separation based on time-frequency masking.

Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising.

Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR

A blind source separation approach for humpback whale song separation.

Associative Memory Model-Based Linear Filtering and Its Application to Tandem Connectionist Blind Source Separation

Decoding EEG in Cognitive Tasks With Time-Frequency and Connectivity Masks