AV-CrossNet: An Audiovisual Complex Spectral Mapping Network for Speech Separation by Leveraging Narrow- and Cross-Band Modeling
AV-CrossNet: An Audiovisual Complex Spectral Mapping Network for Speech Separation by Leveraging Narrow- and Cross-Band Modeling
- Research Article
282
- 10.1109/taslp.2019.2955276
- Nov 22, 2019
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
Phase is important for perceptual quality of speech. However, it seems intractable to directly estimate phase spectra through supervised learning due to their lack of spectrotemporal structure in it. Complex spectral mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of speech. Inspired by multi-task learning, we propose a gated convolutional recurrent network (GCRN) for complex spectral mapping, which amounts to a causal system for monaural speech enhancement. Our experimental results suggest that the proposed GCRN substantially outperforms an existing convolutional neural network (CNN) for complex spectral mapping in terms of both objective speech intelligibility and quality. Moreover, the proposed approach yields significantly higher STOI and PESQ than magnitude spectral mapping and complex ratio masking. We also find that complex spectral mapping with the proposed GCRN provides an effective phase estimate.
- Conference Article
143
- 10.1109/icassp.2019.8682834
- May 1, 2019
Phase is important for perceptual quality in speech enhancement. However, it seems intractable to directly estimate phase spectrogram through supervised learning due to lack of clear structure in phase spectrogram. Complex spectral mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of noisy speech. In this paper, we propose a new convolutional recurrent network (CRN) for complex spectral mapping, which leads to a causal system for noise- and speaker-independent speech enhancement. In terms of objective intelligibility and perceptual quality, the proposed CRN significantly outperforms an existing convolutional neural network (CNN) for complex spectral mapping, as well as a strong CRN for magnitude spectral mapping. We additionally incorporate a newly-developed group strategy to substantially reduce the number of trainable parameters and the computational cost without sacrificing performance.
- Conference Article
4
- 10.1109/icassp43922.2022.9747237
- May 23, 2022
Despite the rapid advance of automatic speech recognition (ASR) technologies, accurate recognition of cocktail party speech characterised by the interference from overlapping speakers, background noise and room reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, audio-visual speech enhancement techniques have been developed, although predominantly targeting overlapping speech separation and recognition tasks. In this paper, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all three stages of the system is proposed. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches based on DNN-WPE and spectral mapping respectively. The learning cost function mismatch between the separation and dereverberation models and their integration with the back-end recognition system is minimised using fine-tuning on the MSE and LF-MMI criteria. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline audio-visual multi-channel speech separation and recognition system containing no dereverberation module by a statistically significant word error rate (WER) reduction of 2.06% absolute (8.77% relative).
- Research Article
22
- 10.1016/j.neucom.2021.08.130
- Sep 6, 2021
- Neurocomputing
Spectral mapping with adversarial learning for unsupervised hyperspectral change detection
- Research Article
57
- 10.1109/taslp.2022.3145319
- Jan 1, 2022
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
As the most widely-used spatial filtering approach for multi-channel speech separation, beamforming extracts the target speech signal arriving from a specific direction. An emerging alternative approach is multi-channel complex spectral mapping, which trains a deep neural network (DNN) to directly estimate the real and imaginary spectrograms of the target speech signal from those of the multi-channel noisy mixture. In this all-neural approach, the trained DNN itself becomes a nonlinear, time-varying spectrospatial filter. However, it remains unclear how this approach performs relative to commonly-used beamforming techniques on different array configurations and acoustic environments. This paper is devoted to examining this issue in a systematic way. Comprehensive evaluations show that multi-channel complex spectral mapping achieves separation performance comparable to or better than beamforming for different array geometries and speech separation tasks and reduces to monaural complex spectral mapping in single-channel conditions, demonstrating the general utility of this approach on multi-channel and single-channel speech separation. In addition, such an approach is computationally more efficient than widely-used mask-based beamforming. We conclude that this neural spectrospatial filter provides a strong alternative to traditional and mask-based beamforming.
- Research Article
90
- 10.1080/014311699213622
- Jan 1, 1999
- International Journal of Remote Sensing
Imaging spectrometers have the potential to identify surface mineralogy based on the unique absorption features in pixel spectra. A back-propagation neural network (BPN) is introduced to classify Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) of the Cuprite mining district (Nevada) data into mineral maps. The results are compared with the traditional acquired surface mineralogy maps from spectral angle mapping (SAM). There is no misclassification for the training set in the case of BPN; however 17percent misclassification occurs in SAM. The validation accuracy of the SAM is 69percent, whereas BPN results in 86percent accuracy. The calibration accuracy of the BPN is higher than that of the SAM, suggesting that the training process of BPN is better than that of the SAM. Thehigh classification accuracyobtained withthe BPN can beexplained by: (1) its ability to deal with complex relationships (e.g., 40 dimensions) and (2) the nature of the dataset, the minerals are highly concentrated and they are mostly represented by pure pixels. This paper demonstrates that BPN has superior classification ability when applied to imaging spectrometer data.
- Research Article
32
- 10.1002/anie.201800624
- May 7, 2018
- Angewandte Chemie International Edition
A concise insight into the outputs provided by the latest prototype of visible-near infrared (Vis-NIR) multispectral scanner (National Research Council-National Institute of Optics, CNR-INO, Italy) is presented. The analytical data acquired on an oil painting Madonna of the Rabbit by É. Manet are described. In this work, the Vis-NIR was complemented with X-ray fluorescence (XRF) mapping for the chemical and spatial characterization of several pigments. The spatially registered Vis-NIR data facilitated their processing by spectral correlation mapping (SCM) and artificial neural network (ANN) algorithm, respectively, for pigment mapping and improved visibility of pentimenti and of underdrawing style. The data provided several key elements for the comparison with a homonymous original work by Titian studied within the ARCHive LABoratory (ARCHLAB) transnational access project.
- Conference Article
5
- 10.1109/igarss.2011.6049458
- Jul 1, 2011
This study observed rock and alteration mineral mapping using HyMap hyperspectral data over the Warmbad district, south of Namibia. The classification result using proposed methods consist of PPI derived from the Spectral Hourglass method and field spectra with the Neural Network classification method show good result for mapping various sericites in pegmatite veins, propylitic alteration of chlorite and epidote, and amphibole in mafic and ultra mafic rock of gabbro, gabbro norite, which are mostly observed in the Warmbad district.
- Research Article
9
- 10.1002/ange.201800624
- May 7, 2018
- Angewandte Chemie
A concise insight into the outputs provided by the latest prototype of visible‐near infrared (Vis‐NIR) multispectral scanner (National Research Council‐National Institute of Optics, CNR‐INO, Italy) is presented. The analytical data acquired on an oil paintingMadonna of the Rabbitby É. Manet are described. In this work, the Vis‐NIR was complemented with X‐ray fluorescence (XRF) mapping for the chemical and spatial characterization of several pigments. The spatially registered Vis‐NIR data facilitated their processing by spectral correlation mapping (SCM) and artificial neural network (ANN) algorithm, respectively, for pigment mapping and improved visibility ofpentimentiand of underdrawing style. The data provided several key elements for the comparison with a homonymous original work by Titian studied within the ARCHive LABoratory (ARCHLAB) transnational access project.
- Research Article
20
- 10.1002/ecjb.10065
- Jul 9, 2002
- Electronics and Communications in Japan (Part II: Electronics)
In this paper, a method for generating broadband speech from band‐limited speech using spectral linear mapping is proposed. This method is based on LPC analysis and synthesis. It first extracts sound source information (residual waveforms) and vocal tract information (spectral envelopes) from input speech, performs linear mapping of the vocal tract information and nonlinear processing of the sound source information for broadband conversion, and finally generates broadband speech from both by LPC synthesis. The spectral envelope is made broadband by linear mapping, the spectral space is divided into a number of subspaces, and a narrowband spectrum is converted into a broadband spectrum by a conversion (convert) matrix of each subspace. A conversion (convert) matrix is estimated using training speech such that the mean square error between the postconversion spectrum and the target broadband spectrum is minimized. Spectral distortions of the proposed method were compared experimentally with codebook mapping and neural network methods and it has been verified that the proposed method with linear mapping gives performance not inferior to that of the other two methods. In addition, the proposed method has been verified to have the effect of giving a band feeling in subject evaluation tests. © 2002 Wiley Periodicals, Inc. Electron Comm Jpn Pt 2, 85(8): 44–53, 2002; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjb.10065
- Conference Article
8
- 10.1109/slt.2018.8639608
- Dec 1, 2018
This paper presents an evaluation of deep spectral mapping and WaveNet vocoder in voice conversion (VC). In our VC framework, spectral features of an input speaker are converted into those of a target speaker using the deep spectral mapping, and then together with the excitation features, the converted waveform is generated using WaveNet vocoder. In this work, we compare three different deep spectral mapping networks, i.e., a deep single density network (DSDN), a deep mixture density network (DMDN), and a long short-term memory recurrent neural network with an autoregressive output layer (LSTM-AR). Moreover, we also investigate several methods for reducing mismatches of spectral features used in WaveNet vocoder between training and conversion processes, such as some methods to alleviate oversmoothing effects of the converted spectral features, and another method to refine WaveNet using the converted spectral features. The experimental results demonstrate that the LSTM-AR yields nearly better spectral mapping accuracy than the others, and the proposed WaveNet refinement method significantly improves the naturalness of the converted waveform.
- Research Article
9
- 10.1109/taslp.2024.3366760
- Jan 1, 2024
- IEEE/ACM transactions on audio, speech, and language processing
The presence of background noise or competing talkers is one of the main communication challenges for cochlear implant (CI) users in speech understanding in naturalistic spaces. These external factors distort the time-frequency (T-F) content including magnitude spectrum and phase of speech signals. While most existing speech enhancement (SE) solutions focus solely on enhancing the magnitude response, recent research highlights the importance of phase in perceptual speech quality. Motivated by multi-task machine learning, this study proposes a deep complex convolution transformer network (DCCTN) for complex spectral mapping, which simultaneously enhances the magnitude and phase responses of speech. The proposed network leverages a complex-valued U-Net structure with a transformer within the bottleneck layer to capture sufficient low-level detail of contextual information in the T-F domain. To capture the harmonic correlation in speech, DCCTN incorporates a frequency transformation block in the encoder structure of the U-Net architecture. The DCCTN learns a complex transformation matrix to accurately recover speech in the T-F domain from a noisy input spectrogram. Experimental results demonstrate that the proposed DCCTN outperforms existing model solutions such as the convolutional recurrent network (CRN), deep complex convolutional recurrent network (DCCRN), and gated convolutional recurrent network (GCRN) in terms of objective speech intelligibility and quality, both for seen and unseen noise conditions. To evaluate the effectiveness of the proposed SE solution, a formal listener evaluation involving four CI recipients was conducted. Results indicate a significant improvement in speech intelligibility performance for CI recipients in noisy environments. Additionally, DCCTN demonstrates the capability to suppress highly non-stationary noise without introducing musical artifacts commonly observed in conventional SE methods.
- Conference Article
31
- 10.1109/slt48900.2021.9383624
- Jan 19, 2021
In this work, we exploit speech enhancement for improving a recurrent neural network transducer (RNN-T) based ASR system. We employ a dense convolutional recurrent network (DCRN) for complex spectral mapping based speech enhancement, and find it helpful for ASR in two ways: a data augmentation technique, and a preprocessing frontend. In using it for ASR data augmentation, we exploit a KL divergence based consistency loss that is computed between the ASR outputs of original and enhanced utterances. In using speech enhancement as an effective ASR frontend, we propose a three-step training scheme based on model pretraining and feature selection. We evaluate our proposed techniques on a challenging social media English video dataset, and achieve an average relative improvement of 11.2% with speech enhancement based data augmentation, 8.3% with enhancement based preprocessing, and 13.4% when combining both.
- Conference Article
2
- 10.1109/globalsip.2016.7905840
- Dec 1, 2016
We present a recurrent neural network (RNN) based speech bandwidth extension (BWE) method. The conventional Gaussian mixture model (GMM) based BWE methods perform stably and effectively. However, GMM based methods suffer from two fundamental and competing problems: 1) inadequacy of GMM in modeling the non-linear relationship between the low frequency (LF) and high frequency (HF), 2) temporal correlations across speech frames are ignored, resulting in spectral detail loss of the reconstructed speech by BWE. To cope these problems, a RNN is employed to capture temporal information and construct deep non-linear relationships between the spectral envelope features of LF and HF. The proposed RNN is trained layer-by-layer from a cascade of two recurrent temporal restricted Boltzmann machines (RTRBMs) and a feedforward neural network (NN). The proposed method takes advantage of the strong ability of RTRBMs in discovering the temporal correlation between adjacent frames and modeling deep non-linear relationships between input and output. Both the objective and subjective evaluations indicate that our proposed method outperforms the conventional GMM based methods and other NN based methods.
- Research Article
- 10.1016/j.neucom.2025.131051
- Nov 1, 2025
- Neurocomputing
Multimodal and multichannel speech separation using location-guided speech feature mapping network
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.