Multi-channel speech enhancement plays a critical role in numerous speech-related applications. Several previous works explicitly utilize deep neural networks (DNNs) to exploit tempo-spectral signal characteristics, which often leads to excellent performance. In this work, we present a time-frequency fusion model, namely TFFM, for multi-channel speech enhancement. We utilize three cascaded U-Nets to capture three types of high-resolution features, aiming to investigate their individual contributions. To be specific, the first U-Net keeps the time dimension and performs feature extraction along the frequency dimension for the high-resolution spectral features with global temporal information, the second U-Net keeps the frequency dimension and extracts features along the time dimension for the high-resolution temporal features with global spectral information, and the third U-Net downsamples and upsamples along both the frequency and time dimensions for the high-resolution tempo-spectral features. These three cascaded U-Nets are designed to aggregate local and global features, thereby effectively handling the tempo-spectral information of speech signals. The proposed TFFM in this work outperforms state-of-the-art baselines.
Read full abstract