Abstract

One criterion in the design of binaural sound scenes in audio production is the extent to which the intended speech message is correctly understood. Object-based audio broadcasting systems have permitted sound editors to gain more access to the metadata (e.g., intensity and location) of each sound source, providing better control over speech intelligibility. The current study describes and evaluates a binaural distortion-weighted glimpse proportion metric-BiDWGP-which is motivated by better-ear glimpsing and binaural masking level differences. BiDWGP predicts intelligibility from two alternative input forms: either binaural recordings or monophonic recordings from each sound source along with their locations. Two listening experiments were performed with stationary noise and competing speech, one in the presence of a single masker, the other with multiple maskers, for a variety of spatial configurations. Overall, BiDWGP with both input forms predicts listener keyword scores with correlations of 0.95 and 0.91 for single- and multi-masker conditions, respectively. When considering masker type separately, correlations rise to 0.95 and above for both types of maskers. Predictions using the two input forms are very similar, suggesting that BiDWGP can be applied to the design of sound scenes where only individual sound sources and their locations are available.

Highlights

  • Speech output, both natural and synthetic, is increasingly used in applications such as spoken dialogue systems, broadcast audio, and in public address systems

  • Traditional channel-based broadcasting systems are being gradually challenged by object-based systems, which have greater flexibility for sound production and can provide better transplantability to audio products

  • The current study describes an objective intelligibility metrics (OIMs) designed to estimate the intelligibility of speech sources in binaurally presented sound scenes

Read more

Summary

Introduction

Both natural and synthetic, is increasingly used in applications such as spoken dialogue systems, broadcast audio, and in public address systems. Within many object-based audio systems, information about the spatial configuration of the target speech source and potential maskers is available as a parameter of the design process. In broadcast audio applications where dialogue is involved (e.g., Sonnenscheinn, 2001; Mapp, 2008), a sound editor may wish to know the approximate speech intelligibility of the a)A preliminary version of part of this work was presented in “A glimpsebased approach for predicting binaural intelligibility with single and multiple maskers in anechoic conditions,” Proceedings of INTERSPEECH, Dresden, Germany, September 2015.

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call