Is it possible and useful to compute STOI from Mel-specrograms? how is it possible?

Answer from top 10 papers

The Short-Time Objective Intelligibility (STOI) measure is a method used to predict the intelligibility of speech signals, which is typically computed using time-frequency representations of speech such as spectrograms (Fazeel et al., 2023; Fuh et al., 2020). While the traditional STOI algorithm does not explicitly mention the use of Mel-spectrograms, which are a perceptually weighted version of the spectrogram, the literature suggests that speech intelligibility prediction methods, including STOI, can benefit from various time-frequency representations of speech (Jensen & Taal, 2016).
Interestingly, Andersen et al. (2018) introduces a novel approach that transforms Mel-spectrograms into 3D data for underwater acoustic target classification, indicating that Mel-spectrograms can be manipulated and potentially used in advanced intelligibility prediction models. Additionally, Gul et al. (2024) highlights the effectiveness of Mel spectrogram features in speech enhancement for hearing-impaired scenarios, which suggests that Mel-spectrograms could be relevant in the context of intelligibility prediction.
In summary, while STOI is not traditionally computed from Mel-spectrograms, the literature indicates that Mel-spectrogram features can be effectively used in speech enhancement and potentially in intelligibility prediction models. Therefore, it is plausible that STOI could be computed from Mel-spectrograms, provided that an appropriate transformation or adaptation is applied to align with the STOI algorithm's requirements. This could be useful in scenarios where perceptual weighting is beneficial, such as in hearing-impaired speech enhancement (Gul et al., 2024). However, further research would be needed to develop and validate such an approach.

Source Papers

Refinement and validation of the binaural short time objective intelligibility measure for spatially diverse conditions

Speech intelligibility prediction methods have recently gained popularity in the speech processing community as supplements to time consuming and costly listening experiments. Such methods can be used to objectively quantify and compare the advantage of different speech enhancement algorithms, in a way that correlates well with actual speech intelligibility. One such method is the short-time objective intelligibility (STOI) measure. In a recent publication, we proposed a binaural version of the STOI measure, based on a modified version of the equalization cancellation (EC) model. This measure was shown to retain many of the advantageous properties of the STOI measure, while at the same time being able to predict intelligibility correctly in conditions involving both binaural advantage and non-linear signal processing. The biggest prediction errors were found for conditions involving multiple spatially distributed interferers. In this paper, we report results for a new listening experiment including different mixtures of isotropic and point source noise. This exposes that the binaural STOI measure has a tendency to overestimate the intelligibility in conditions with spatially distributed interferes at low signal to noise ratios (SNRs). This condition-dependent error can make it difficult to compare intelligibility across different acoustical conditions. We investigate the cause of this upward bias, and propose a correction which alleviates the problem. The modified method is evaluated with five datasets of measured intelligibility, spanning a wide range of realistic acoustic conditions. Within the tested conditions, the modified method yields very accurate predictions, and entirely alleviates the aforementioned tendency to overestimate intelligibility in conditions with spatially distributed interferers.

Single-channel speech enhancement using colored spectrograms

Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted from the pix2pix generative adversarial network (GAN) and train it over colored spectrograms of speech to denoise them. After denoising, the colors of spectrograms are translated to magnitudes of short-time Fourier transform (STFT) using a shallow regression neural network. These estimated STFT magnitudes are later combined with the noisy phases to obtain an enhanced speech. The results show an improvement of almost 0.84 points in the perceptual evaluation of speech quality (PESQ) and 1 % in the short-term objective intelligibility (STOI) over the unprocessed noisy data. The gain in quality and intelligibility over the unprocessed signal is almost equal to the gain achieved by the baseline methods used for comparison with the proposed model, but at a much reduced computational cost. The proposed solution offers a comparative PESQ score at almost 10 times reduced computational cost than a similar baseline model that has generated the highest PESQ score trained on grayscaled spectrograms, while it provides only a 1 % deficit in STOI at 28 times reduced computational cost when compared to another baseline system based on convolutional neural network-GAN (CNN-GAN) that produces the most intelligible speech.

Open Access
Experimental Investigation of Acoustic Features to Optimize Intelligibility in Cochlear Implants.

Although cochlear implants work well for people with hearing impairment in quiet conditions, it is well-known that they are not as effective in noisy environments. Noise reduction algorithms based on machine learning allied with appropriate speech features can be used to address this problem. The purpose of this study is to investigate the importance of acoustic features in such algorithms. Acoustic features are extracted from speech and noise mixtures and used in conjunction with the ideal binary mask to train a deep neural network to estimate masks for speech synthesis to produce enhanced speech. The intelligibility of this speech is objectively measured using metrics such as Short-time Objective Intelligibility (STOI), Hit Rate minus False Alarm Rate (HIT-FA) and Normalized Covariance Measure (NCM) for both simulated normal-hearing and hearing-impaired scenarios. A wide range of existing features is experimentally evaluated, including features that have not been traditionally applied in this application. The results demonstrate that frequency domain features perform best. In particular, Gammatone features performed best for normal hearing over a range of signal-to-noise ratios and noise types (STOI = 0.7826). Mel spectrogram features exhibited the best overall performance for hearing impairment (NCM = 0.7314). There is a stronger correlation between STOI and NCM than HIT-FA and NCM, suggesting that the former is a better predictor of intelligibility for hearing-impaired listeners. The results of this study may be useful in the design of adaptive intelligibility enhancement systems for cochlear implants based on both the noise level and the nature of the noise (stationary or non-stationary).

Open Access
Differential treatment for time and frequency dimensions in mel-spectrograms: An efficient 3D Spectrogram network for underwater acoustic target classification

Underwater acoustic target classification (UATC) has traditionally relied on time–frequency (TF) analysis, with the mel-spectrogram being widely used due to its 2D image-like format and size efficiency. However, many previous UATC approaches have not adequately acknowledged the distinction between natural images and mel-spectrograms. In this paper, we propose a novel approach to transform mel-spectrograms into 3D data and introduce an efficient 3D Spectrogram Network (3DSNet) that treats the time and frequency dimensions separately. Our 3DSNet consists of three key components: the Time–Frequency Separate Convolution (TFSConv) module, Asymmetric Pooling (AsyPool) module, and Channel-Time Attention (CTA) module. We introduce the TFSConv module, which utilizes two convolution operators for the time and frequency dimensions to extract 3D T-F features. This module serves as a lightweight approximation of 3D convolution, effectively reducing approximately 85 percent of parameters. To maintain the inherent distinction between the frequency and time dimensions, we propose an AsyPool module that employs two downsampling strategies. Additionally, we introduce a CTA module to capture more informative and meaningful T-F features. We evaluate our 3DSNet on two publicly available underwater acoustic datasets, and the results demonstrate that our method achieves the optimal balance between performance and model parameters compared to other mainstream methods.

Single channel speech enhancement by colored spectrograms

Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented, using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted from the pix2pix generative adversarial network (GAN) and train it over colored spectrograms of speech to denoise them. After denoising, the colors of spectrograms are translated to magnitudes of short-time Fourier transform (STFT) using a shallow regression neural network. These estimated STFT magnitudes are later combined with the noisy phases to obtain an enhanced speech. The results show an improvement of almost 0.84 points in the perceptual evaluation of speech quality (PESQ) and 1% in the short-term objective intelligibility (STOI) over the unprocessed noisy data. The gain in quality and intelligibility over the unprocessed signal is almost equal to the gain achieved by the baseline methods used for comparison with the proposed model, but at a much reduced computational cost. The proposed solution offers a comparative PESQ score at almost 10 times reduced computational cost than a similar baseline model that has generated the highest PESQ score trained on grayscaled spectrograms, while it provides only a 1% deficit in STOI at 28 times reduced computational cost when compared to another baseline system based on convolutional neural network-GAN (CNN-GAN) that produces the most intelligible speech.

An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers

Intelligibility listening tests are necessary during development and evaluation of speech processing algorithms, despite the fact that they are expensive and time consuming. In this paper, we propose a monaural intelligibility prediction algorithm, which has the potential of replacing some of these listening tests. The proposed algorithm shows similarities to the short-time objective intelligibility (STOI) algorithm, but works for a larger range of input signals. In contrast to STOI, extended STOI (ESTOI) does not assume mutual independence between frequency bands. ESTOI also incorporates spectral correlation by comparing complete 400ms length spectrograms of the noisy/processed speech and the clean speech signals. As a consequence, ESTOI is also able to accurately predict the intelligibility of speech contaminated by temporally highly modulated noise sources in addition to noisy signals processed with time-frequency weighting. We show that ESTOI can be interpreted in terms of an orthogonal decomposition of short-time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility. A free MATLAB implementation of the algorithm is available for noncommercial use at http://kom.aau.dk/~jje/.