Binaural Rendering in MPEG Surround

Lars Villemoes,Kristofer Kjörling,Jeroen Breebaart

doi:10.1155/2008/732895

Abstract

This paper describes novel methods for evoking a multichannel audio experience over stereo headphones. In contrast to the conventional convolution-based approach where, for example, five input channels are filtered using ten head-related transfer functions, the current approach is based on a parametric representation of the multichannel signal, along with either a parametric representation of the head-related transfer functions or a reduced set of head-related transfer functions. An audio scene with multiple virtual sound sources is represented by a mono or a stereo downmix signal of all sound source signals, accompanied by certain statistical (spatial) properties. These statistical properties of the sound sources are either combined with statistical properties of head-related transfer functions to estimate that represent the perceptually relevant aspects of the auditory scene or used to create a limited set of combined head-related transfer functions that can be applied directly on the downmix signal. Subsequently, a binaural rendering stage reinstates the statistical properties of the sound sources by applying the estimated binaural parameters or the reduced set of combined head-related transfer functions directly on the downmix. If combined with parametric multichannel audio coders such as MPEG Surround, the proposed methods are advantageous over conventional methods in terms of perceived quality and computational complexity.

Highlights

The synthesis of virtual auditory scenes has been an ongoing research topic for many years [1,2,3,4,5]
The basic principle is to combine the original set of head-related transfer functions (HRTFs) or BRTFs into a limited set of four impulse responses that can be directly applied on the stereo downmix
The proposed method is beneficial since it only operates on four filters as opposed to ten filters normally used for binaural rendering of a five channel signal, and it enables the use of echoic impulse responses (BRIRs)

Summary

Introduction

The synthesis of virtual auditory scenes has been an ongoing research topic for many years [1,2,3,4,5]. In the case of echoic impulse responses (so-called binaural room impulse responses (BRIRs), or binaural room transfer functions (BRTFs)), the parametric approach is not capable of accurate modeling of all relevant perceptual aspects. In this case, a less compact HRTF or BRTF representation can be obtained by extending the 2×2 processing matrix in the time domain (i.e., having multiple “taps”). The basic principle is to combine the original set of HRTFs or BRTFs into a limited set of four impulse responses that can be directly applied on the stereo downmix This is feasible when a representation of the original multichannel signal is available, which relies on stereo downmix and a set of spatial parameters, as is the case for MPEG Surround. It has been shown that for low frequencies, ITDs dominate sound source localization, while at high frequencies, ILDs and spectral cues

Methods

Results

Discussion

Conclusion