Abstract

Binaural Ambisonic rendering is widely used in immersive applications such as virtual reality due to its sound field rotation capabilities. Binaural Ambisonic reproduction can theoretically replicate the original sound field exactly for frequencies up to what is commonly referred to as the `spatial aliasing frequency', f_alias. At frequencies above f_alias however, reproduction can become inaccurate due to the limited spatial accuracy of reproducing a physical sound field with a finite number of transducers, which in practice causes localisation blur, reduced lateralisation and comb filtering spectral artefacts. The standard approach to improving Ambisonic reproduction is to increase the order of Ambisonics, which allows for exact sound field reproduction up to a higher f_alias, though at the expense of more channels for storage, more microphone capsules for recording, and more convolutions in binaural reproduction. It is therefore highly desirable to explore alternative methods of improving low-order Ambisonic rendering. One common practice is to employ a dual-band decoder with basic Ambisonic decoding at low frequencies and Max r_E channel weighting above f_alias, which improves spectral, localisation and lateralisation reproduction. Virtual loudspeaker binaural Ambisonic decoders can be made by multiplying each loudspeaker's head-related impulse responses (HRIRs) with the decode matrix coefficients and summing the resulting spherical harmonic (SH) channels. This approach allows for dual-band decoding and loudspeaker configurations with more loudspeakers than SH channels whilst minimising the required number of convolutions. Binaural Ambisonic reproduction using the virtual loudspeaker approach is then achieved by a summation of direct convolution of each SH channel of the encoded signal with the corresponding SH channel of the binaural decoder. This paper presents the method and results of a perceptual comparison of state-of-the-art pre-processing techniques for virtual loudspeaker binaural Ambisonic rendering. By implementing these pre-processing techniques in the HRTFs used in the virtual loudspeaker binaural rendering stage, improvements can be made to the rendering. All pre-processing techniques are implemented offline, such that the resulting binaural decoders are of the same size and require the same number of real-time convolutions.The three pre-processing techniques investigated in this study are:\beginitemize item Diffuse-field Equalisation (DFE) item Ambisonic Interaural Level Difference Optimisation (AIO) item Time Alignment (TA) \enditemizeDFE is the removal of direction-independent spectral artefacts in the Ambisonic diffuse-field. AIO augments the gains of the left and right virtual loudspeaker HRTF signals above f_alias such that Ambisonic renders produce more accurate interaural level differences (ILDs). TA is the removal of interaural time differences (ITDs) between the HRTFs above f_alias to reduce high frequency comb filtering effects.The test follows the multiple stimulus with hidden reference and anchors (MUSHRA) paradigm, ITU-R BS.1534-3. Tests are conducted in a quiet listening room using a single set of Sennheiser HD~650 circum-aural headphones and an Apple Macbook Pro with a Fireface UCX audio interface, which has software controlled input and output levels. Headphones are equalised from the RMS average of 11 impulse response measurements, with 1 octave band smoothing in the inverse filter. All audio is 24-bit depth and 48~kHz sample rate. Listening tests are conducted using first, third and fifth order Ambisonics, with respective loudspeaker configurations comprising 6, 26 and 50 loudspeakers, arranged in Lebedev grids. The different test conditions are made up of various combinations of the three pre-processing techniques. The test conditions are as follows:\beginenumerate item HRTF convolution (reference) item Standard Ambisonic (dual band) item Ambisonic with DFE (dual band) item Ambisonic with AIO (dual band) item Ambisonic with AIO & DFE (dual band) item Ambisonic with TA & DFE (basic) item Ambisonic with TA & AIO & DFE (basic) item Ambisonic with TA & AIO & DFE (dual band)\endenumerateThe stimuli are synthesised complex acoustic scenes, defined in this paper as an acoustic scene with multiple sources. The synthesised complex scene used in this paper is composed from 24 freely available stems of an orchestra. Instruments are isolated and empirically matched in loudness. The orchestral stems are panned to the vertices of a 24 pt. T-design arrangement, to ensure minimal overlap between virtual loudspeaker positions in the binaural decoders and the sound sources in the complex scene. Synthesising complex scenes in this way allows for an explicit target reference stimulus - in this case a direct HRTF convolved render. If the Ambisonic stimuli are perfectly reconstructed, they will be equivalent to the reference stimulus. Results are analysed using non-parametric statistics and discussed in the full manuscript. The conclusion suggests the perceptually preferred pre-processing algorithms for virtual loudspeaker binaural Ambisonic rendering.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call