Abstract
Interaural coherence is used to quantify the effects of reverberation on speech, and previous studies applied the conventional method using all previous time data in the form of an infinite impulse response filter to estimate interaural coherence. To consider a characteristic of speech that continuously changes over time, this paper proposes a new method of estimating interaural coherence using time data within a finite length of speech, which is called the quasi-steady interval. The length of the quasi-steady interval is determined with various frequency bands, reverberation times, and short-time Fourier transform (STFT) variables through numerical experiment, and it decreased as reverberation time decreased and the frequency increased. In this interval, a diffuse speech, which is an infinite sum of reflected speeches of different propagating paths, is uncorrelated between two microphones apart from each other; thus, the coherence is close to zero. However, a direct speech measured at the two microphones has steady amplitude and phase difference in this internal; thus, the coherence is close to one. Moreover, the new method is the form of a finite impulse response filter that has a linear phase delay or zero phase delay with respect to speech to frequency; thus, the same or zero time delay for each frequency is applied to the power spectral density. Therefore, the coherence estimation of the new method is closer to the ideal value than the conventional one, and the coherence is accurately estimated at the time–frequency bins of direct speech, which is time-varying according to speech variation.
Highlights
In a real situation, influences of the surrounding environment, such as multiple speakers, external noise sources, reverberation, etc., distort target speech information such that the performance of speech recognition methods get worse [1]
The binaural room impulse response was based on the Aachen Impulse Response (AIR) database [17], which recorded impulse according to various reverberation environments and azimuth using a KEMAR dummy head
The optimal LT and β according to gender, frequency, and reverberation time were determined, but the results showed that gender does not affect optimal parameters
Summary
Influences of the surrounding environment, such as multiple speakers, external noise sources, reverberation, etc., distort target speech information such that the performance of speech recognition methods get worse [1]. Among the causes of speech distortion, multiple speakers and external noise sources are additive noises, which have a low correlation with the target speech, making it easy to extract information about the target speech. Reverberation, on the other hand, is convolutive noise caused by the sound waves reflected by the surrounding walls or objects. In reverberant speech, both direct speech and reverberation, which is attenuated direct speech with a time delay, are involved. The performance of speech separation decreases because the reverberation changes the amplitude and phase of time–frequency bins of direct speech [2,3]; speech recognition performance is degraded
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.