Abstract

In a previous publication, the potential for improving speech recognition by increasing frequency resolution within each channel of a cochlear implant speech processing algorithm was investigated as an alternative to increasing the number of channels (Throckmorton et al., 2006). The study was conducted with normal-hearing subjects listening to speech tokens processed through an acoustic model of a speech processing algorithm with the intent to investigate whether the proposed method improved speech recognition. The results suggested that with as few as two discrete frequencies per channel, an increase in speech recognition performance was possible. Although several caveats must be considered and are discussed at length by Throckmorton et al. (2006), these results suggest the possibility that cochlear implant recipients might derive benefit from increased frequency resolution within channels when increasing the number of channels is not an option. In the acoustic model, rather than presenting a constant frequency for each channel as is traditionally done (e.g. Dorman et al., 1997), the frequency content of each channel was estimated and a presentation frequency was chosen based on this estimate. For computational simplicity, a short-time Fourier transform (STFT) was used to analyze each 2 ms window of speech and select the closest 1 of N predefined presentation frequencies in each channel. However, given the relatively low frequency content in some of the channels, a 2-ms window duration does not sample enough of the signal to provide an accurate estimate of the frequency spectrum. The authors originally speculated that this lack of accuracy would be less of an issue for implementation in cochlear implants due to the inability to match stimulation rate to an exact frequency, but that the lack of accuracy may have influenced results in the normal-hearing study (Throckmorton et al., 2006). Therefore, the authors have repeated a subset of the previous study utilizing two alternative frequency estimation techniques as well as the STFT. Only MCFA-2 (Multiple Carrier Frequency Algorithm with 2 predefined carrier frequencies per channel) was tested since the increase from one to two available presentation frequencies per channel had the greatest impact on speech recognition. The frequency was estimated for each channel using the STFT as originally proposed, an overlapping FFT with window durations that were channel dependent (see Table I), and a Flanagan phase vocoder (Flanagan and Golden, 1966). In addition to these estimation techniques, a random selection strategy was used for comparison. Ten audiometrically-normal subjects were recruited from the Duke University staff and student population. Only vowel tokens (‘h/V/d’ format) were tested since enhancing frequency resolution within channel had the greatest effect on vowel recognition in the previous study. Tokens were presented in quiet as well as with speech-shaped noise added at three different signal-to-noise ratios (SNRs): 5, 0, and −5 dB. TABLE I Duration of windows for overlapping FFT for each channel. The results of the vowel recognition task for MCFA-2 are shown in Fig. 1. The bar graphs are grouped for each noise condition, and from darker to lighter shades they present the results for the Flanagan phase vocoder, the overlapping FFT, the STFT methods, and the random strategy, respectively. The 95% confidence interval for each score was calculated and is indicated by the error bars on the bar plots. The performance with the STFT was significantly lower (p < 0.05) than the performance with the Flanagan phase vocoder and the overlapping FFT at 5 and −5 dB SNR respectively. However, neither the phase vocoder nor overlapping FFT techniques consistently provided improved performance; thus indicating an inconsistent benefit from increased frequency estimation accuracy. The significantly poorer performance (p < 0.05) with the random strategy at all noise levels indicates that the benefits of additional frequency resolution are only relevant if the selected frequencies represent the frequency content of the signal. Although the STFT was essentially a random strategy at the lowest channels, its ability to provide reasonable frequency estimates for the channels containing the formants resulted in significant performance gains over the random strategy (p < 0.05). Given that the more computationally complex algorithms investigated here provided limited, inconsistent benefit for normal hearing subjects, such approaches to frequency estimation may not be required for implementation in cochlear implant speech processing algorithms. Fig. 1 Vowel recognition for the frequency estimation techniques at four different noise levels. Error bars indicate a 95% confidence interval.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.